CN113723089A - Word segmentation model training method, word segmentation method, data processing method and data processing device - Google Patents

Word segmentation model training method, word segmentation method, data processing method and data processing device Download PDF

Info

Publication number
CN113723089A
CN113723089A CN202010448100.6A CN202010448100A CN113723089A CN 113723089 A CN113723089 A CN 113723089A CN 202010448100 A CN202010448100 A CN 202010448100A CN 113723089 A CN113723089 A CN 113723089A
Authority
CN
China
Prior art keywords
word segmentation
data
entity
training
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010448100.6A
Other languages
Chinese (zh)
Other versions
CN113723089B (en
Inventor
王潇斌
徐光伟
龙定坤
马春平
丁瑞雪
谢朋峻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010448100.6A priority Critical patent/CN113723089B/en
Publication of CN113723089A publication Critical patent/CN113723089A/en
Application granted granted Critical
Publication of CN113723089B publication Critical patent/CN113723089B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word segmentation model training method, a word segmentation method, a data processing method and a data processing device. The word segmentation model training method comprises the following steps: training by using word segmentation labeling data to obtain a word segmentation model; acquiring entity tagging data, and adding word segmentation labels to an entity part and a non-entity part of the entity tagging data according to a preset rule; and training the word segmentation model by using the entity labeling data after adding the word segmentation labels. The word segmentation model can be fitted with the word segmentation boundary rule in the entity tagging data, the word segmentation boundary of the word segmentation model is finally consistent with the word segmentation boundary of the entity tagging model trained by utilizing the entity tagging data, and the possibility of word segmentation boundary conflict caused by the simultaneous use of the word segmentation model and the entity tagging model is avoided.

Description

Word segmentation model training method, word segmentation method, data processing method and data processing device
Technical Field
The invention relates to the technical field of text processing, in particular to a word segmentation model training method, a word segmentation method, a data processing method and a data processing device.
Background
The word segmentation model and the entity tagging model are generally sequence tagging models based on word granularity, the word segmentation model is built by performing model training on a large amount of word segmentation data, and the entity tagging model is built by performing model training on a large amount of entity tagging data. When the two are used together, there may be a case of boundary conflict, such as for the sentence "zhujing office representation", the word segmentation result is "zhujing office representation", and the entity annotation result is "zhujing/LOC office representation" (where,/is an entity annotation format, "/" front "jing" is an entity word, and "/" back "LOC" is an annotation result).
Disclosure of Invention
In view of the above, the present invention has been made to provide a segmentation model training method, a segmentation method, and a data processing method and apparatus that overcome or at least partially solve the above problems.
In a first aspect, an embodiment of the present invention provides a method for training a segmentation model, including:
training by using word segmentation labeling data to obtain a word segmentation model;
acquiring entity tagging data, and adding word segmentation labels to an entity part and a non-entity part of the entity tagging data respectively according to a preset rule;
and training the word segmentation model by using the entity labeling data after adding the word segmentation labels.
In some optional embodiments, the training using the segmentation labeling data to obtain the segmentation model specifically includes:
using word segmentation labeling data, and training by adopting a conditional random field CRF loss function to obtain a word segmentation model;
correspondingly, the training of the word segmentation model by using the entity labeling data after adding the word segmentation label specifically comprises:
and training the word segmentation model by using the entity marking data after adding the word segmentation labels and adopting a CRF loss function.
In some optional embodiments, the training with the conditional random field CRF loss function to obtain the word segmentation model specifically includes:
selecting first label data in the word segmentation label data, and generating a determined label sequence and a possible label sequence combination corresponding to the first label data;
determining a first joint probability of the first annotation data and the determined tag sequence, and a second joint probability of the first annotation data and each possible tag sequence in a possible tag sequence combination;
according to the first joint probability and the second joint probability, training a first normative parameter in a first objective function constructed according to a CRF loss function by adopting a random gradient descent training method;
and stopping training if the descending amplitude of the value of the first objective function is lower than a preset first descending threshold value.
In some optional embodiments, the generating of the determined tag sequence and the possible tag sequence combination corresponding to the first annotation data specifically includes:
generating a determined BIES label sequence corresponding to the first labeling data according to the word segmentation condition of the first labeling data;
and determining the possible BIES label of each word according to the position of each word in the first annotation data, and generating the possible BIES label sequence combination corresponding to the first annotation data according to the possible BIES label of each word.
In some optional embodiments, the training of the segmentation model by using the entity labeling data after adding the segmentation labels and using a CRF loss function specifically includes:
selecting second labeling data in the entity labeling data after the word segmentation labels are added, and generating a determined label sequence combination and a possible label sequence combination corresponding to the second labeling data;
determining a third joint probability of the second labeling data and each determined label sequence in the determined label sequence combination respectively, and determining a fourth joint probability of the second labeling data and each possible label sequence in the possible label sequence combination respectively;
training a second normative parameter in a second objective function constructed according to the CRF loss function by adopting a random gradient descent training method according to the third joint probability and the fourth joint probability;
and if the descending amplitude of the value of the second objective function is lower than a preset second descending threshold value, stopping training.
In some optional embodiments, the generating of the determined tag sequence combination corresponding to the second annotation data specifically includes:
determining a determined BIES label of each character in the entity tagging participle in the second tagging data;
determining a possible BIES label of each word according to the position of each word of the non-entity part in the second labeling data relative to the adjacent entity labeling participle;
and generating a determined BIES label sequence combination corresponding to the second labeling data according to the determined BIES label and the possible BIES label.
In some optional embodiments, the determining, according to a position of each word of the non-entity part in the second annotation data relative to the adjacent entity annotation participle, a possible BIES label of each word specifically includes:
adding (S, E) labels to the characters of the first non-entity part on the left side of the entity tagging participle in the second tagging data;
and adding (S, B) labels to the characters of the first non-entity part on the right side of the entity tagging participle in the second tagging data.
In some optional embodiments, generating a possible tag sequence combination corresponding to the second annotation data specifically includes:
determining the possible BIES label of each word according to the position of each word in the second annotation data;
and generating a possible BIES label sequence combination corresponding to the second labeling data according to the possible BIES label of each word.
In a second aspect, an embodiment of the present invention provides a word segmentation method, including:
and performing word segmentation on the target text by using the word segmentation model trained by the word segmentation model training method to obtain a word segmentation result.
In a third aspect, an embodiment of the present invention provides a data processing method, including:
performing word segmentation on the target text by using the word segmentation model trained according to the word segmentation model training method to obtain a word segmentation text;
labeling the target text by using an entity labeling model to obtain an entity labeling text, wherein the entity labeling model is trained by using the entity labeling data in advance;
judging whether the boundary of the labeled participle is consistent with the corresponding participle boundary in the participle text or not aiming at each labeled participle in the entity labeled text;
if yes, marking the participles in the participle text according to the marking information of the marked participles.
In a fourth aspect, an embodiment of the present invention provides a training apparatus for a segmentation model, including:
the first training module is used for training by using word segmentation labeling data to obtain a word segmentation model;
the second training module is used for obtaining entity marking data and adding word segmentation labels to the entity part and the non-entity part of the entity marking data according to a preset rule; and training the word segmentation model by using the entity labeling data after adding the word segmentation labels.
In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which computer instructions are stored, which, when executed by a processor, implement the above-mentioned word segmentation model training method, or implement the above-mentioned word segmentation method, or implement the above-mentioned data processing method.
In a sixth aspect, an embodiment of the present invention provides a server, including: the word segmentation method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the word segmentation method is realized, or the data processing method is realized.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) the word segmentation model training method provided by the embodiment of the invention obtains the word segmentation model by using word segmentation data training, and further trains and adjusts the word segmentation model by using the word segmentation condition of the entity tagging word in the entity tagging data, so that the word segmentation model can fit the entity tagging word boundary rule in the entity tagging data, the word segmentation boundaries of the word segmentation model and the entity tagging model are finally consistent, and the possibility of word segmentation boundary conflict caused by the simultaneous use of the word segmentation model and the entity tagging model is avoided.
(2) The word segmentation model training method provided by the embodiment of the invention further trains and adjusts the word segmentation model by using the entity labeling word segmentation result in the entity labeling data; instead of performing word segmentation and entity labeling on the same corpus at the same time, the boundary consistency is ensured from the corpus level. Therefore, the calculation amount is reduced on the corpus level, and the training cost of the word segmentation model is reduced.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a training method of a segmentation model according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a specific implementation of a training method for a segmentation model according to a second embodiment of the present invention;
fig. 3 is a flowchart illustrating a specific implementation of generating a tag sequence combination corresponding to second annotation data according to a second embodiment of the present invention;
FIG. 4 is a flowchart of a data processing method according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a word segmentation model training device in an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problem of boundary conflict existing in the prior art when performing word segmentation and entity labeling on a text at the same time, embodiments of the present invention provide a word segmentation model training method, a word segmentation method, a data processing method, and a device, so as to avoid word segmentation boundary conflict when performing word segmentation and entity labeling on a target text at the same time.
Example one
The embodiment of the invention provides a training method of a word segmentation model, the flow of which is shown in figure 1, and the method comprises the following steps:
step S11: and training by using word segmentation labeling data to obtain a word segmentation model.
In one embodiment, the segmentation model may be obtained by training with a Conditional Random Field (CRF) loss function using segmentation labeling data.
Specifically, first annotation data in the segmentation annotation data is selected, and a determined tag sequence and a possible tag sequence combination corresponding to the first annotation data are generated; determining a first joint probability of the first labeling data and the determined label sequences, and a second joint probability of each possible label sequence in the combination of the first labeling data and the possible label sequences respectively; training a first normative parameter in a first objective function constructed according to the CRF loss function by adopting a random gradient descent training method according to the first joint probability and the second joint probability; and when the descending amplitude of the value of the first objective function is lower than a preset first descending threshold value, stopping training.
Specifically, the tag sequence may be a BIES sequence, where B is begin, which represents the first character of a word segmentation; i is inside and represents a middle character of a participle; e is end, which represents the tail word of a participle; s is single, which means that a word is a single word.
Specifically, the first tagging data in the segmentation tagging data may be segmentation information of each sample in the segmentation tagging data, and the determination BIES tag of each character in the sample may be determined according to the segmentation information, so as to generate a determination BIES tag sequence of the sample; all possible BIES tag sequence combinations for a sample are generated from the possible BIES tags for each word in the sample.
Taking "I am in beijing city" as an example, generating BIES according to the word segmentation information thereof determines that the tag sequence is { (S), (B), (I), (E) }. Since the segmentation information is deterministic, the generated BIES determines the tag sequence to be unique. Meanwhile, if the word segmentation information is ignored, the corresponding label is only B or S as the character of 'I' is at the beginning part; the words "in", "north" and "jing" are all in the middle position, possibly corresponding to labels B, I, E and S, either; the "city" word ends, and the corresponding tag may only be E or S. Therefore, the final generated possible BIES tag sequence combinations are { (B, S), (B, I, E, S), (E, S) }, which includes 256(2 × 4 × 4 × 4 × 2) possible BIES tag sequences.
Step S12: and acquiring entity labeling data, and adding word segmentation labels to the entity part and the non-entity part of the entity labeling data respectively according to a preset rule.
Determining a determination label of each word of the entity part according to the word segmentation condition of the entity part in the sample aiming at each sample in the entity labeling data; determining possible labels for each word of the non-entity portion; the determined tag for each word of the entity portion and the possible tags for each word of the non-entity portion are combined into a determined sequence of tags for the sample data. And directly determining the possible label of each character without considering the word segmentation condition in the sample, and generating the possible label sequence combination of the sample data.
Specifically, the tag sequence may be a BIES sequence.
Step S13: and training the word segmentation model by using the entity labeling data after adding the word segmentation labels.
If the conditional random field CRF loss function is used for training to obtain the segmentation model in step S11, the entity tagging data after the segmentation labels are added is used, and the CRF loss function is used for training to adjust the segmentation model. Optionally, training and training adjustment of the word segmentation model may also adopt other methods, and the specific method is not limited in this embodiment as long as the two methods use the same method.
The word segmentation model training method provided by the embodiment of the invention obtains the word segmentation model by using word segmentation data training, and further trains and adjusts the word segmentation model by using the word segmentation condition of the entity tagging word in the entity tagging data, so that the word segmentation model can fit the entity tagging word boundary rule in the entity tagging data, the word segmentation boundaries of the word segmentation model and the entity tagging model are finally consistent, and the possibility of word segmentation boundary conflict caused when the word segmentation model and the entity tagging model are used simultaneously is avoided.
According to the word segmentation model training method provided by the embodiment of the invention, the word segmentation model is further trained and adjusted by using the entity tagging word segmentation result in the entity tagging data; instead of performing word segmentation and entity labeling on the same corpus at the same time, the boundary consistency is ensured from the corpus level. Therefore, the calculation amount is reduced on the corpus level, and the training cost of the word segmentation model is reduced.
Example two
The second embodiment of the present invention provides a specific implementation of a word segmentation model training method, the flow of which is shown in fig. 2, and the method includes the following steps:
step S21: and selecting first label data in the word segmentation label data, and generating a determined label sequence and a possible label sequence combination corresponding to the first label data.
In one embodiment, the method may include generating and determining a BIES tag sequence according to the word segmentation condition of the first annotation data; determining the possible BIES label of each word according to the position of each word in the first annotation data, and generating the possible BIES label sequence combination of the first annotation data according to the possible BIES label of each word.
That is, according to the word segmentation condition of the first annotation data, a unique BIES label of each word can be determined, and a unique BIES determination label sequence corresponding to the first annotation data is generated; ignoring word segmentation, determining all possible BIES labels of each word only according to the position (starting position, ending position or middle position) of each word in the first annotation data, and generating a possible BIES label sequence combination of the first annotation data according to all possible BIES labels of each word.
Step S22: a first joint probability of the first labeling data and the determined tag sequences is determined, and a second joint probability of each possible tag sequence in the combination of the first labeling data and the possible tag sequences is determined.
Step S23: and training a first specification parameter in a first objective function constructed according to the CRF loss function by adopting a random gradient descent training method according to the first joint probability and the second joint probability.
Specifically, the following function may be used as the first objective function, where w in the following function is the first specification parameter:
Figure BDA0002506690730000081
in the formula, xiFirst annotation data y representing the ith segmentation sample data in the segmentation annotation dataiIndicating the determined label sequence corresponding to the ith piece of first labeling data,
Figure BDA0002506690730000082
represents the j possible label sequence corresponding to the i-th piece of first annotation data, f (x)i,yi) Representing the ith item of first annotation data and determining the tag sequence yiIs determined by the first joint probability of (a),
Figure BDA0002506690730000091
representing the ith piece of first annotation data and the jth possible tag sequence
Figure BDA0002506690730000092
A second combined probability of.
With x1I am in Beijing, y1For example, if { (S), (B), (I), (E) } is not used, then
Figure BDA0002506690730000093
For any sequence in the sequence combinations of { (B, S), (B, I, E, S), (B, I, E, S), (B, I, E, S), (E, S) } tags, the "I" word is at the beginning, so the corresponding tag may only be B or S, the "in", "North" and "Jing" words are all in the middle position, the corresponding tag may be either B, I, E or S, and the "City" word is at the end, and the corresponding tag may only be E or S.
Step S24: and judging whether the descending amplitude of the value of the first objective function is lower than a preset first descending threshold value or not.
If the determination at step S24 is no, the process continues to step S23 until the determination at step S24 is yes, and step S25 is performed.
Since the first determination of the value of the first objective function cannot be followed by determining the magnitude of the drop, the default is that step S24 is no after step S23 is performed for the first time, and step S23 is continued.
Step S25: and selecting second labeling data in the entity labeling data after the word segmentation labels are added, and generating a determined label sequence combination and a possible label sequence combination corresponding to the second labeling data.
Specifically, referring to fig. 3, the determined tag sequence combination corresponding to the second labeled data may be generated as follows:
step S31: and determining BIES labels of all characters in the entity tagging participles in the second tagging data.
Taking "i work at university (ORG)" as an example, the entity label is "university", the determination label of "large" is B, and the determination label of "school" is E.
Step S32: and determining the possible BIES label of each word according to the position of each word of the non-entity part in the second labeling data relative to the adjacent entity labeling participle.
In one embodiment, the entity in the second annotation data may be annotated with a word of the first non-entity part to the left of the participle, and then a tag is added (S, E); and adding (S, B) labels to the characters of the first non-entity part on the right side of the entity tagging participle in the second tagging data.
Take "i work at university (ORG)" as an example, "at" marks the word of the first non-entity part to the left of the participle "university" for an entity, adding an (S, E) tag; "worker" is the word of the first non-entity part to the right of the entity annotation participle "university" with an (S, B) tag added.
Meanwhile, the word "I" is at the beginning, so the corresponding label is (B, S); the word "do" ends with the corresponding label (E, S).
Step S33: and generating a determined BIES label sequence combination corresponding to the second labeling data according to the determined BIES label and the possible BIES label.
Taking "i work at university (ORG)" as an example, the finally generated combination of the determined tag sequences is any one of the combinations of { (B, S), (E, S), (B), (E), (B, S), (E, S) } tag sequences.
For a word in the entity tagging sample data that is not the beginning or the end, nor adjacent to the entity tagging participle, its possible tag sequence is (B, I, E, S).
Correspondingly, the possible tag sequence combinations corresponding to the second labeled data can be generated as follows:
determining the possible BIES label of each word according to the position of each word in the second annotation data; and generating a possible BIES label sequence combination corresponding to the second labeling data according to the possible BIES label of each word.
Specifically, for the first word of the second annotation data, the possible label is B or S; for the last word of the second annotation data, the possible label is E or S; for the intermediate words in the second annotation data that are not beginning or end, their possible labels are (B, I, E, S).
Taking "i work at university (ORG)" as an example, the "i" word is at the beginning of the second annotation data; so the corresponding label is B or S, the "doing" word is at the end of the second annotation data, possibly the corresponding label is E or S; the labels of the intermediate words are (B, I, E, S), so the possible sequence combinations of labels for "I work at university (ORG)" are { (B, S), (B, I, E, S), (B, I, E, S), (B, I, E, S), (E, S) }.
Step S26: and determining a third joint probability of the second labeling data and each determined label sequence in the determined label sequence combination respectively, and determining a fourth joint probability of the second labeling data and each possible label sequence in the possible label sequence combination respectively.
Step S27: and training a second specification parameter in a second objective function constructed according to the CRF loss function by adopting a random gradient descent training method according to the third joint probability and the fourth joint probability.
Specifically, the following function may be used as the second objective function, and w' in the following function is the second specification parameter:
Figure BDA0002506690730000111
in the formula, xi' second annotation data representing the ith item of entity annotation data,
Figure BDA0002506690730000112
the m-th determined label sequence corresponding to the ith piece of second annotation data is represented;
Figure BDA0002506690730000113
the nth possible label sequence represents the ith piece of second annotation data;
Figure BDA0002506690730000114
representing a third association probability of the ith second label data and the mth determined label sequence;
Figure BDA0002506690730000115
and (3) representing a fourth joint probability of the ith piece of second annotation data and the nth possible label sequence.
Step S28: and judging whether the descending amplitude of the value of the second objective function is lower than a preset second descending threshold value or not.
If the determination at step S28 is no, the process continues to step S27 until the determination at step S28 is yes, and step S29 is performed.
Since the second objective function value cannot be determined after the first determination, the default is that step S27 is performed for the first time, and then step S27 is continued after step S28 determines no.
Specifically, the second drop threshold may be the same as or different from the first drop threshold in step S24.
Step S29: and stopping training the word segmentation model.
When the determination in step S28 is yes, the training of the adjustment of the segmentation model is stopped.
Based on the inventive concept of the present invention, an embodiment of the present invention further provides a word segmentation method, including performing word segmentation on a target text by using a word segmentation model trained according to the word segmentation model training method described above, so as to obtain a word segmentation result.
EXAMPLE III
An embodiment of the present invention provides a data processing method, a flow of which is shown in fig. 4, and the method includes the following steps:
step S41: and performing word segmentation on the target text by using a word segmentation model to obtain a word segmentation text.
Specifically, the word segmentation model is trained according to the word segmentation model training method described in the first embodiment or the second embodiment.
Step S42: and labeling the target text by using the entity labeling model to obtain an entity labeling text.
Specifically, the entity labeling model is trained in advance by using the entity labeling data described in the first embodiment.
The steps S41 and S42 are not in sequence, and any one of the steps may be executed first, or may be executed simultaneously. Specifically, the target texts operated in step S41 and step S42 are the same target text, and the acquired target texts may be copied for executing step S41 and step S42, respectively.
Step S43: and judging whether the boundary of the labeled participle is consistent with the corresponding participle boundary in the participle text or not aiming at each labeled participle in the entity labeled text.
If yes, go to step S44.
Step S44: and marking the corresponding participles in the participle text according to the marking information of the marked participles.
Based on the inventive concept of the present invention, an embodiment of the present invention further provides a training apparatus for a segmentation model, which has a structure as shown in fig. 5, and includes:
a first training module 51, configured to train to obtain a segmentation model by using segmentation labeling data;
the second training module 52 is configured to obtain entity tagging data, and add word segmentation labels to an entity part and a non-entity part of the entity tagging data according to a preset rule; and training the word segmentation model by using the entity labeling data after adding the word segmentation labels.
In some optional embodiments, the first training module 51 obtains a segmentation model by training using segmentation labeling data, and is specifically configured to:
using word segmentation labeling data, and training by adopting a conditional random field CRF loss function to obtain a word segmentation model; correspondingly, the second training module 52 trains the word segmentation model by using the entity tagging data after adding the word segmentation label, and is specifically configured to:
and training the word segmentation model by using the entity marking data after adding the word segmentation labels and adopting a CRF loss function.
In some optional embodiments, the first training module 51 obtains the word segmentation model by training using a conditional random field CRF loss function, and is specifically configured to:
selecting first label data in the word segmentation label data, and generating a determined label sequence and a possible label sequence combination corresponding to the first label data; determining a first joint probability of the first annotation data and the determined tag sequence, and a second joint probability of the first annotation data and each possible tag sequence in a possible tag sequence combination; according to the first joint probability and the second joint probability, training a first normative parameter in a first objective function constructed according to a CRF loss function by adopting a random gradient descent training method; and stopping training if the descending amplitude of the value of the first objective function is lower than a preset first descending threshold value.
In some optional embodiments, the first training module 51 generates a determined tag sequence and a possible tag sequence combination corresponding to the first annotation data, and is specifically configured to:
generating a determined BIES label sequence corresponding to the first labeling data according to the word segmentation condition of the first labeling data; and determining the possible BIES label of each word according to the position of each word in the first annotation data, and generating the possible BIES label sequence combination corresponding to the first annotation data according to the possible BIES label of each word.
In some optional embodiments, the second training module 52 trains the word segmentation model by using the entity labeling data after adding the word segmentation label and using a CRF loss function, specifically to:
selecting second labeling data in the entity labeling data after the word segmentation labels are added, and generating a determined label sequence combination and a possible label sequence combination corresponding to the second labeling data; determining a third joint probability of the second labeling data and each determined label sequence in the determined label sequence combination respectively, and determining a fourth joint probability of the second labeling data and each possible label sequence in the possible label sequence combination respectively; training a second normative parameter in a second objective function constructed according to the CRF loss function by adopting a random gradient descent training method according to the third joint probability and the fourth joint probability; and if the descending amplitude of the value of the second objective function is lower than a preset second descending threshold value, stopping training.
In some optional embodiments, the second training module 52 generates a determined tag sequence combination corresponding to the second annotation data, and is specifically configured to:
determining a determined BIES label of each character in the entity tagging participle in the second tagging data; determining a possible BIES label of each word according to the position of each word of the non-entity part in the second labeling data relative to the adjacent entity labeling participle; and generating a determined BIES label sequence combination corresponding to the second labeling data according to the determined BIES label and the possible BIES label.
In some optional embodiments, the second training module 52 determines, according to a position of each word of the non-entity part in the second annotation data relative to the adjacent entity annotation participle, a possible BIES label for each word, specifically for:
adding (S, E) labels to the characters of the first non-entity part on the left side of the entity tagging participle in the second tagging data; and adding (S, B) labels to the characters of the first non-entity part on the right side of the entity tagging participle in the second tagging data.
In some optional embodiments, the second training module 52 generates a possible tag sequence combination corresponding to the second annotation data, and is specifically configured to:
determining the possible BIES label of each word according to the position of each word in the second annotation data; and generating a possible BIES label sequence combination corresponding to the second labeling data according to the possible BIES label of each word.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Based on the inventive concept of the present invention, an embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the method for training a word segmentation model described above is implemented, or the method for word segmentation is implemented, or the method for data processing is implemented.
Based on the same inventive concept, an embodiment of the present invention further provides a server, including: the word segmentation method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the word segmentation method is realized, or the data processing method is realized.
Unless specifically stated otherwise, terms such as processing, computing, calculating, determining, displaying, or the like, may refer to an action and/or process of one or more processing or computing systems or similar devices that manipulates and transforms data represented as physical (e.g., electronic) quantities within the processing system's registers and memories into other data similarly represented as physical quantities within the processing system's memories, registers or other such information storage, transmission or display devices. Information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or". The terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Claims (12)

1. A method for training a segmentation model is characterized by comprising the following steps:
training by using word segmentation labeling data to obtain a word segmentation model;
acquiring entity tagging data, and adding word segmentation labels to an entity part and a non-entity part of the entity tagging data respectively according to a preset rule;
and training the word segmentation model by using the entity labeling data after adding the word segmentation labels.
2. The method of claim 1, wherein the training using the segmentation annotation data to obtain the segmentation model specifically comprises:
using word segmentation labeling data, and training by adopting a conditional random field CRF loss function to obtain a word segmentation model;
correspondingly, the training of the word segmentation model by using the entity labeling data after adding the word segmentation label specifically comprises:
and training the word segmentation model by using the entity marking data after adding the word segmentation labels and adopting a CRF loss function.
3. The method of claim 2, wherein the training with the conditional random field CRF loss function to obtain the segmentation model comprises:
selecting first label data in the word segmentation label data, and generating a determined label sequence and a possible label sequence combination corresponding to the first label data;
determining a first joint probability of the first annotation data and the determined tag sequence, and a second joint probability of the first annotation data and each possible tag sequence in a possible tag sequence combination;
according to the first joint probability and the second joint probability, training a first normative parameter in a first objective function constructed according to a CRF loss function by adopting a random gradient descent training method;
and stopping training if the descending amplitude of the value of the first objective function is lower than a preset first descending threshold value.
4. The method of claim 3, wherein the generating of the determined tag sequence and possible tag sequence combinations corresponding to the first annotation data comprises:
generating a determined BIES label sequence corresponding to the first labeling data according to the word segmentation condition of the first labeling data;
and determining the possible BIES label of each word according to the position of each word in the first annotation data, and generating the possible BIES label sequence combination corresponding to the first annotation data according to the possible BIES label of each word.
5. The method of claim 2, wherein the training of the segmentation model using the entity labeling data after adding the segmentation labels using a CRF loss function comprises:
selecting second labeling data in the entity labeling data after the word segmentation labels are added, and generating a determined label sequence combination and a possible label sequence combination corresponding to the second labeling data;
determining a third joint probability of the second labeling data and each determined label sequence in the determined label sequence combination respectively, and determining a fourth joint probability of the second labeling data and each possible label sequence in the possible label sequence combination respectively;
training a second normative parameter in a second objective function constructed according to the CRF loss function by adopting a random gradient descent training method according to the third joint probability and the fourth joint probability;
and if the descending amplitude of the value of the second objective function is lower than a preset second descending threshold value, stopping training.
6. The method of claim 5, wherein the generating of the determined tag sequence combination corresponding to the second annotation data specifically comprises:
determining a determined BIES label of each character in the entity tagging participle in the second tagging data;
determining a possible BIES label of each word according to the position of each word of the non-entity part in the second labeling data relative to the adjacent entity labeling participle;
and generating a determined BIES label sequence combination corresponding to the second labeling data according to the determined BIES label and the possible BIES label.
7. The method of claim 6, wherein determining the possible BIES label for each word based on the position of each word relative to the adjacent entity labeled participles for the non-entity portion in the second label data comprises:
adding (S, E) labels to the characters of the first non-entity part on the left side of the entity tagging participle in the second tagging data;
and adding (S, B) labels to the characters of the first non-entity part on the right side of the entity tagging participle in the second tagging data.
8. The method of claim 5, wherein generating the possible tag sequence combinations corresponding to the second annotation data comprises:
determining the possible BIES label of each word according to the position of each word in the second annotation data;
and generating a possible BIES label sequence combination corresponding to the second labeling data according to the possible BIES label of each word.
9. A method of word segmentation, comprising:
performing word segmentation on the target text by using the word segmentation model trained according to the word segmentation model training method of any one of claims 1 to 8 to obtain a word segmentation result.
10. A data processing method, comprising:
performing word segmentation on a target text by using a word segmentation model trained according to the word segmentation model training method of any one of claims 1 to 8 to obtain a word segmentation text;
labeling the target text by using an entity labeling model to obtain an entity labeling text, wherein the entity labeling model is trained by using the entity labeling data of any one of claims 1 to 8 in advance;
judging whether the boundary of the labeled participle is consistent with the corresponding participle boundary in the participle text or not aiming at each labeled participle in the entity labeled text;
if yes, marking the participles in the participle text according to the marking information of the marked participles.
11. A word segmentation model training device, comprising:
the first training module is used for training by using word segmentation labeling data to obtain a word segmentation model;
the second training module is used for obtaining entity marking data and adding word segmentation labels to the entity part and the non-entity part of the entity marking data according to a preset rule; and training the word segmentation model by using the entity labeling data after adding the word segmentation labels.
12. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the method of training a segmentation model according to any one of claims 1 to 8, or implement the method according to claim 9 or 10.
CN202010448100.6A 2020-05-25 2020-05-25 Word segmentation model training method, word segmentation method and data processing method and device Active CN113723089B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010448100.6A CN113723089B (en) 2020-05-25 2020-05-25 Word segmentation model training method, word segmentation method and data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010448100.6A CN113723089B (en) 2020-05-25 2020-05-25 Word segmentation model training method, word segmentation method and data processing method and device

Publications (2)

Publication Number Publication Date
CN113723089A true CN113723089A (en) 2021-11-30
CN113723089B CN113723089B (en) 2023-12-26

Family

ID=78671712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010448100.6A Active CN113723089B (en) 2020-05-25 2020-05-25 Word segmentation model training method, word segmentation method and data processing method and device

Country Status (1)

Country Link
CN (1) CN113723089B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719907A (en) * 2023-06-26 2023-09-08 阿波罗智联(北京)科技有限公司 Data processing method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016179987A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN110223737A (en) * 2019-06-13 2019-09-10 电子科技大学 A kind of chemical composition of Chinese materia medica name entity recognition method and device
WO2019174423A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Entity sentiment analysis method and related apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016179987A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
WO2019174423A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Entity sentiment analysis method and related apparatus
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN110223737A (en) * 2019-06-13 2019-09-10 电子科技大学 A kind of chemical composition of Chinese materia medica name entity recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邓丽萍;罗智勇;: "基于半监督CRF的跨领域中文分词", 中文信息学报, no. 04 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719907A (en) * 2023-06-26 2023-09-08 阿波罗智联(北京)科技有限公司 Data processing method, device, equipment and storage medium
CN116719907B (en) * 2023-06-26 2024-06-11 阿波罗智联(北京)科技有限公司 Data processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113723089B (en) 2023-12-26

Similar Documents

Publication Publication Date Title
US10372821B2 (en) Identification of reading order text segments with a probabilistic language model
CN110765996B (en) Text information processing method and device
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN112016310A (en) Text error correction method, system, device and readable storage medium
CN111310447B (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
CN107544726B (en) Speech recognition result error correction method and device based on artificial intelligence and storage medium
CN111695385B (en) Text recognition method, device and equipment
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
CN110705302B (en) Named entity identification method, electronic equipment and computer storage medium
CN110795938A (en) Text sequence word segmentation method, device and storage medium
CN110457683B (en) Model optimization method and device, computer equipment and storage medium
CN111291566A (en) Event subject identification method and device and storage medium
US11763588B2 (en) Computing system for extraction of textual elements from a document
CN103268185A (en) Text display method and text display device for e-book reader
CN113657098B (en) Text error correction method, device, equipment and storage medium
US20160055146A1 (en) Document processing device, document processing method, program, and information storage medium
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
CN111079432A (en) Text detection method and device, electronic equipment and storage medium
CN113723089A (en) Word segmentation model training method, word segmentation method, data processing method and data processing device
CN110705211A (en) Text key content marking method and device, computer equipment and storage medium
CN111753535A (en) Method and device for generating patent application text
CN110018827B (en) Method and device for automatically generating code, electronic equipment and readable storage medium
CN109166569B (en) Detection method and device for phoneme mislabeling
CN111145724B (en) Polyphone marking method and device and computer readable storage medium
CN115098722B (en) Text and image matching method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant