CN114065740A - Sentence sequence labeling method and device, electronic equipment and storage medium - Google Patents

Sentence sequence labeling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114065740A
CN114065740A CN202111153477.XA CN202111153477A CN114065740A CN 114065740 A CN114065740 A CN 114065740A CN 202111153477 A CN202111153477 A CN 202111153477A CN 114065740 A CN114065740 A CN 114065740A
Authority
CN
China
Prior art keywords
word
character
vector
vocabulary
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111153477.XA
Other languages
Chinese (zh)
Inventor
刘旭东
罗京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202111153477.XA priority Critical patent/CN114065740A/en
Publication of CN114065740A publication Critical patent/CN114065740A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a sentence sequence labeling method, which comprises the steps of carrying out word segmentation processing on a target sentence through a preset domain dictionary to obtain a word segmentation set; aiming at each character in the target sentence, acquiring a vector of the character in each group of vocabulary set in N groups of vocabulary sets based on the word segmentation set, and acquiring a word vector of the character according to the vector of the character in each group of vocabulary set; inputting the vector of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence, so that the target sentence is subjected to word segmentation processing through a preset domain dictionary to improve the accuracy of word segmentation; and then according to the word segmentation set, obtaining the vector of each character in each group of word sets, and obtaining the word vector of the character, so that the representation information of each character in the target word can be effectively enhanced, and on the basis of improving the word segmentation accuracy and enhancing the representation information of each character, the accuracy of the sequence label output by the sequence label model can be improved.

Description

Sentence sequence labeling method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of language processing technologies, and in particular, to a method and an apparatus for sequence tagging of sentences, and an electronic device.
Background
With the rapid development of artificial intelligence, the application fields of machine learning and deep learning are wider and wider, and the machine learning and the deep learning can realize understanding and reasoning to a certain degree, so that the machine is adaptive, and the intelligence of the machine is higher.
In the prior art, a sequence marking frame based on machine learning and deep learning is usually used as a main body frame of an end-to-end Point of Interest (POI) analysis model, the frame abstracts the analysis of the POI into a sequence marking process, when the frame is used for performing sequence marking on Chinese writing, two modes are usually adopted, wherein the first mode is that word segmentation processing is firstly performed on a POI text, and words are used as marks for performing sequence marking; secondly, a POI word segmentation processing step is cancelled, and sequence labeling is carried out by directly taking characters (Chinese characters and English letters) as labels; when the sequence labeling is carried out by adopting the first mode, because the existing word segmentation processing has errors, and the errors are transmitted to the subsequent sequence labeling stage, the condition of low accuracy of the sequence labeling occurs; however, when the second method is used for sequence labeling, due to discarding the prior vocabulary information, it is hoped that the model can identify the potential vocabulary information, which increases the training difficulty of the sequence labeling model, and may result in a situation that the prediction accuracy of the trained sequence labeling model is low, so a method capable of improving the sequence labeling is urgently needed.
Disclosure of Invention
The embodiment of the invention provides a sentence sequence labeling method, a sentence sequence labeling device and electronic equipment, which can effectively improve the accuracy of sequence labeling.
The first aspect of the embodiments of the present invention provides a method for sequence annotation of statements, where the method includes:
performing word segmentation processing on the target sentence through a preset domain dictionary to obtain a word segmentation set;
aiming at each character in the target sentence, acquiring a vector of the character in each group of vocabulary set in N groups of vocabulary sets based on the word segmentation set, and acquiring a word vector of the character according to the vector of the character in each group of vocabulary set, wherein N is an integer greater than 1;
and inputting the vector of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence.
Optionally, the obtaining word vectors of the characters according to the vectors of the characters in each group of vocabulary sets includes:
fusing vectors of the characters in each group of vocabulary sets to obtain vocabulary vectors of the characters;
and enhancing the characterization vectors of the characters by utilizing the vocabulary vectors of the characters to obtain word vectors of the characters.
Optionally, the enhancing the characterization vector of the character by using the vocabulary vector of the character to obtain the word vector of the character includes:
and enhancing the word vectors of the characters to the representation vectors of the characters by a preset enhancement mode to obtain the word vectors of the characters.
Optionally, the preset enhancement mode includes any one of a vector splicing mode and a vector mapping mode.
Optionally, the obtaining, for each character in the word segmentation set, a vector of the character in each of N groups of word sets based on the word segmentation set includes:
if the N groups of word collections include a first word collection, a second word collection, a third word collection, and a fourth word collection, then for each character, according to a corresponding relationship between the character and each participle in the participle collection, a first vector of the character in the first word collection, a second vector in the second word collection, a third vector in the third word collection, and a fourth vector in the fourth word collection are obtained, wherein the first word collection represents a word collection in which the character is a start word, the second word collection represents a word collection in which the character is a middle word, the third word collection represents a word collection in which the character is an end word, and the fourth word collection represents a word collection in which the character is an individual word.
Optionally, the obtaining, for each character, a first vector of the character in the first vocabulary set, a second vector in the second vocabulary set, a third vector in the third vocabulary set, and a fourth vector in the fourth vocabulary set according to a correspondence between the character and each participle in the participle set includes:
for each character, determining a first matching result of the character and the first vocabulary according to the corresponding relation, and acquiring the first vector according to the first matching result; determining a second matching result of the characters and the second vocabulary according to the corresponding relation, and acquiring the second vector according to the second matching result; determining a third matching result of the characters and the third vocabulary according to the corresponding relation, and acquiring the third vector according to the third matching result; and determining a fourth matching result of the characters and the fourth vocabulary according to the corresponding relation, and acquiring the fourth vector according to the fourth matching result.
Optionally, the training step of the sequence labeling model includes:
acquiring a training sample set, wherein each training sample in the training sample set comprises a training sentence;
for each training sample in the training sample set, performing word segmentation processing on the training sample through the preset domain dictionary to obtain a training word segmentation set;
aiming at each character in each training sample, acquiring a training vector of the character in each vocabulary set of the N groups of vocabulary sets based on the training word set, and acquiring a training word vector of the character according to the training vector of the character in each vocabulary set, wherein N is an integer greater than 1;
and performing model training by using the training word vector of each character in each training sample to obtain the sequence labeling model.
The second aspect of the embodiments of the present invention further provides a device for labeling a sequence of sentences, where the device includes:
the word segmentation unit is used for performing word segmentation processing on the target sentence through a preset domain dictionary to obtain a word segmentation set;
the word vector determining unit is used for acquiring a vector of the character in each word set of N word sets aiming at each character in the target sentence, and acquiring a word vector of the character according to the vector of the character in each word set, wherein N is an integer larger than 1;
and the sequence labeling unit is used for inputting the vectors of each character in each group of vocabulary set into a pre-trained sequence labeling model and performing sequence labeling on the target sentence.
Optionally, the word vector determining unit is configured to perform fusion processing on vectors of the characters in each group of vocabulary sets to obtain vocabulary vectors of the characters; and enhancing the characterization vectors of the characters by utilizing the vocabulary vectors of the characters to obtain word vectors of the characters.
Optionally, the word vector determining unit is configured to perform enhancement processing on the token vector of the character by using the vocabulary vector of the character in a preset enhancement mode, so as to obtain the word vector of the character.
Optionally, the word vector determining unit is configured to, if the N groups of word collections include a first word collection, a second word collection, a third word collection, and a fourth word collection, obtain, for each character, a first vector of the character in the first word collection, a second vector in the second word collection, a third vector in the third word collection, and a fourth vector in the fourth word collection according to a correspondence between the character and each participle in the participle collection, where the first word collection represents a word collection in which the character is a start character, the second word collection represents a word collection in which the character is a middle character, the third word collection represents a word collection in which the character is an end character, and the fourth word collection represents a word collection in which the character is an individual character.
A third aspect of an embodiment of the present invention provides an electronic device, including a memory and one or more programs, where the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operation instructions included in the one or more programs for performing the method for labeling a sequence of sentences according to the first aspect.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps corresponding to the method for labeling a sequence of sentences according to the first aspect.
The above one or at least one technical solution in the embodiments of the present invention has at least the following technical effects:
based on the technical scheme, word segmentation processing is carried out on the target sentence through a preset domain dictionary to obtain a word segmentation set; aiming at each character in the target sentence, obtaining a vector of the character in each word set of N groups of word sets according to the word set, and obtaining a word vector of the character according to the vector of the character in each word set; inputting the vectors of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence; at the moment, the target sentence is subjected to word segmentation processing through a preset domain dictionary so as to improve the accuracy of word segmentation; and then according to the word segmentation set, obtaining the vector of each character in each group of word sets, and obtaining the word vector of the character, so that the representation information of each character in the target word can be effectively enhanced, and on the basis of improving the accuracy of word segmentation and enhancing the representation information of each character, the accuracy of sequence labeling output by a sequence labeling model can be improved, thereby realizing the effect of effectively improving the accuracy of sequence labeling, and improving the accuracy of subsequent processing such as subsequent semantic recognition and the like.
Drawings
Fig. 1 is a schematic flowchart of a method for labeling a sequence of a sentence according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of a training method for a sequence annotation model according to an embodiment of the present invention;
fig. 3 is a block diagram of a device for labeling sequences of sentences according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following sets of drawings to explain in detail the main implementation principle, the specific implementation manner and the corresponding beneficial effects of the technical solution of the embodiment of the present invention.
Examples
Referring to fig. 1, an embodiment of the present invention provides a method for labeling a sequence of a sentence, where the method includes:
s101, performing word segmentation processing on a target sentence through a preset domain dictionary to obtain a word segmentation set;
s102, aiming at each character in the target sentence, obtaining a vector of the character in each group of vocabulary set in N groups of vocabulary sets corresponding to the word set based on the word set, and obtaining a word vector of the character according to the vector of the character in each group of vocabulary set, wherein N is an integer greater than 1;
s103, inputting the vector of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence.
In step S101, the domain dictionary needs to be preprocessed, where the preprocessing of the domain dictionary may include steps of collecting, screening, and cleaning a target dictionary, and various domain dictionaries are obtained, and the obtained various domain dictionaries are used as preset domain dictionaries. In this way, when step S101 is executed, the target sentence may be subjected to word segmentation processing by each dictionary in the preset domain dictionary to obtain a word segmentation set, where the accuracy of word segmentation is the highest. Of course, the target domain to which the target speech belongs may be determined first, the target domain dictionary corresponding to the target domain is obtained from the preset domain dictionary, and then the target domain dictionary is used to perform word segmentation processing on the target sentence to obtain a word segmentation set.
Specifically, when the target sentence is segmented by each dictionary in the preset domain dictionary, the segmentation can be performed on different characters in the target sentence by matching with the corresponding domain dictionary, so as to improve the accuracy of segmentation.
In the embodiment of the present specification, the preset domain dictionary may include a brand dictionary, a place name dictionary, a road name dictionary, a business circle dictionary, and the like.
For example, taking the target word as an example of an "old Beijing duck five-way crossing shop", if the preset domain dictionary comprises a location dictionary, a brand dictionary and a business circle dictionary, firstly segmenting the target word by the location dictionary to obtain a location word containing "Beijing" and "five-way crossing"; using a brand dictionary to perform word segmentation on the target words to obtain brand words containing Beijing-flavored roasted ducks and old ducks; and segmenting the target words by using a business district dictionary to obtain business district words containing five mouths, wherein the finally obtained segmented word set is { old, Beijing roast duck, five mouths }. Therefore, the target sentence is subjected to word segmentation processing through each dictionary in the preset domain dictionary, so that the accuracy of the obtained word segmentation set is higher.
After the set of partial words is obtained, step S102 is performed.
In step S102, N word sets are preset, where the N word sets may include at least one of vocabulary sets such as a first vocabulary set, a second vocabulary set, a third vocabulary set, and a fourth vocabulary set, where the first vocabulary set represents a vocabulary set with characters as initial words, the second vocabulary set represents a vocabulary set with characters as intermediate words, the third vocabulary set represents a vocabulary set with characters as final words, and the fourth vocabulary set represents a vocabulary set with characters as individual words. Preferably, the N sets of vocabulary sets include a first vocabulary set, a second vocabulary set, a third vocabulary set, and a fourth vocabulary set.
In the embodiments of the present specification, N may be, for example, an integer greater than 1, such as 2, 4, 5, and 6.
Of course, the N word sets further include at least one of a fifth vocabulary set, a sixth vocabulary set, and a seventh vocabulary set, where the fifth vocabulary set represents a vocabulary set of a two-word combination, the sixth vocabulary set identifies a vocabulary set of a three-word combination, and the seventh vocabulary set represents a vocabulary set of a four-word combination. In the following, it is specifically exemplified that the N word sets include a first word set, a second word set, a third word set, and a fourth word set.
Specifically, after word vectors of the characters in each group of vocabulary sets are obtained, the vectors of the characters in each group of vocabulary sets can be fused to obtain vocabulary vectors of the characters; and enhancing the characterization vectors of the characters by utilizing the vocabulary vectors of the characters to obtain word vectors of the characters. Therefore, the word vectors of the characters are obtained by enhancing the representation vectors of the characters by utilizing the vocabulary vectors of the characters, so that the word vectors of the characters contain more information, and the representation information of the characters is improved. Of course, the vectors of the characters in each group of vocabulary sets may also be directly used as the vocabulary vectors of the characters, and the present specification is not particularly limited.
In the embodiment of the present specification, when the vectors of the characters in each group of vocabulary sets are subjected to fusion processing, the vectors of the characters in each group of vocabulary sets may be spliced, and the spliced vectors are used as the vocabulary vectors of the characters; of course, the vectors of the characters in each group of vocabulary sets can be input into a simple neural network for feature extraction, and the extracted vectors are used as the vocabulary vectors of the characters.
Specifically, when the vocabulary vector of the character performs enhancement processing on the characterization vector of the character, the vocabulary vector of the character may perform enhancement processing on the characterization vector of the character through a preset enhancement mode to obtain a word vector of the character. The preset enhancement mode comprises any one of a vector splicing mode, a vector mapping mode and the like.
Further, after the vocabulary vector of the character is obtained, the representation vector of the character may be obtained through a Word vector model, for example, a Word2Vec model, and after the vocabulary vector of the character and the representation vector are obtained, the vocabulary vector of the character and the representation vector may be spliced, the spliced vector is used as the Word vector of the character, the vocabulary vector of the character and the representation vector may be mapped through an embedding layer, and the mapped vector is used as the Word vector of the character, which is not limited in this specification.
Specifically, when the N sets of word collections include a first word collection, a second word collection, a third word collection, and a fourth word collection, for each character, a first vector of the character in the first word collection, a second vector in the second word collection, a third vector in the third word collection, and a fourth vector in the fourth word collection may be obtained according to a correspondence between the character and each participle in the participle collection, where the first word collection represents a word collection in which the character is a start word, the second word collection represents a word collection in which the character is a middle word, the third word collection represents a word collection in which the character is an end word, and the fourth word collection represents a word collection in which the character is an individual word, and at this time, the first word collection may be represented by B, the second word collection is represented by M, the third word collection is represented by E, and the fourth word collection is represented by S.
Specifically, when a first vector of the character in the first vocabulary set is obtained, a first matching result of the character and the first vocabulary set can be determined according to the corresponding relation, and then the first vector is obtained according to the first matching result; determining a second matching result of the characters and the second vocabulary according to the corresponding relation, and acquiring a second vector according to the second matching result; determining a third matching result of the characters and the third vocabulary according to the corresponding relation, and acquiring a third vector according to the third matching result; and determining a fourth matching result of the character and the fourth vocabulary according to the corresponding relation, and acquiring a fourth vector according to the fourth matching result, wherein the vector lengths of the first vector, the second vector, the third vector and the fourth vector are the same, so that the subsequent vector operation is facilitated, and the calculation efficiency is improved. Of course, the vector lengths of the first vector, the second vector, the third vector and the fourth vector may be different, and when the vector lengths are different, the first vector, the second vector, the third vector and the fourth vector may be processed by the neural network to obtain a vector with a set length as a vocabulary vector of characters.
When the first vector is obtained according to the first matching result, the first matching can be input into a simple neural network to obtain a vector with a set length as the first vector; similarly, the second matching result can also be input into a simple neural network to obtain a vector with a set length as a second vector; inputting the third matching result into a simple neural network to obtain a vector with a set length as a third vector; and inputting the fourth matching result into the simple neural network to obtain a vector with a set length as a fourth vector. The set length can be set manually or by equipment, or can be set according to actual requirements, and can be 16, 32, and 64 bits, for example.
For example, taking the target word as "old beijing duck five-crossing shop" as an example, the word segmentation set obtained by segmenting the target word is { old, beijing duck, five-crossing }, so that for each character in each character set, based on the word segmentation set, the matching result of each character in B, M, E, S can be obtained first as shown in table 1 below:
Figure BDA0003287845860000071
Figure BDA0003287845860000081
TABLE 1
In table 1, for the character "old" in the target sentence, only "old" in the brand dictionary is matched in the participle set, and the brand "old" has only one character, so the matching result of B, M, E corresponding to the character "old" and the four kinds of words of S are shown in the first line in table 1; for the character "north", the segmented word set is matched with "Beijing (place name)" and "Beijing roast duck (brand)" beginning with "north", so the matching result of the four types of vocabulary sets corresponding to the character is shown in the second row in Table 1; for the character "jing", the word segmentation set matches "beijing roast duck (brand)" with "jing" as the middle and "beijing (place name)" with "jing" as the end, so the matching result of the four kinds of vocabulary sets corresponding to the character is shown in the third row of table 1; and matching other characters in the target words according to the method to obtain the matching result of each character.
Further, for the character "old", 4 vectors of the character "old" can be obtained by inputting 4 matching results shown in the first row in table 1 into the neural network, and then 4 vectors of the character "old" are spliced with the characterization vector of the character "old", and the spliced vector is used as a word vector of the character.
After the word vector of the character is acquired, step S103 is executed.
Before step S103 is executed, the sequence annotation model needs to be trained in advance, wherein the training step of the sequence annotation model, referring to fig. 2, includes:
s201, obtaining a training sample set, wherein each training sample in the training sample set comprises a training sentence;
specifically, the training sample set may be extracted from a corpus of each domain, and the extracted sentences of each domain may be combined into the training sample set.
S202, performing word segmentation processing on the training samples through the preset domain dictionary aiming at each training sample in the training sample set to obtain a training word segmentation set;
for the specific implementation of step S202, reference may be made to the description of step S101, and for brevity of the description, the description is not repeated here.
S203, aiming at each character in each training sample, acquiring a training vector of the character in each vocabulary set of the N groups of vocabulary sets based on the training divided word set, and acquiring a training word vector of the character according to the training vector of the character in each vocabulary set, wherein N is an integer greater than 1;
for the specific implementation of step S203, reference may be made to the description of step S102, and for brevity of the description, the description is not repeated here.
And S204, performing model training by using the training word vector of each character in each training sample to obtain the sequence labeling model.
Specifically, a constraint condition may be first set, for example, the matching degree between the output sequence label and the artificial sequence label corresponding to the training sample is not less than a set matching degree, and the training word vector of each character in each training sample is model-trained until a model satisfying the constraint is obtained as the sequence label model, where the set matching degree is, for example, 95%, 96%, 98%, 99%, and the like.
Thus, after the sequence annotation model is obtained by training, step S103 is performed. At this time, each character acquired in step S102 is input into the sequence labeling model for the vector in each vocabulary set, and the target sentence is subjected to sequence labeling.
Therefore, when the target sentence is segmented, the segmentation processing is carried out through the preset domain dictionary, and the target sentence is matched with each domain dictionary due to the fact that the preset domain dictionary comprises various domain dictionaries, so that the segmentation accuracy is higher; on one hand, dictionaries in different fields carry different semantic information, and on the other hand, boundary information of vocabularies is carried based on the division of four sets of BMES; therefore, the domain semantic information of the vocabulary and the boundary information of the vocabulary can be fused into the word vector of the character, so that the prior information such as the semantics and the boundary of the domain vocabulary is utilized, and meanwhile, the error propagation problem caused by using the pre-participle and the vocabulary information loss problem caused by singly using the character information are avoided. For example, after the target word "old Beijing roast duck Wudao shop" is processed through the steps, the representation of the "Beijing" character contains semantic information as a place name "Beijing", the segmentation boundary of the "old Beijing roast duck Wudao shop" is positioned at the end, the brand information of the "Beijing roast duck" is contained, and the segmentation boundary is positioned in the middle.
Based on the technical scheme, word segmentation processing is carried out on the target sentence through a preset domain dictionary to obtain a word segmentation set; aiming at each character in the target sentence, obtaining a vector of the character in each word set of N groups of word sets according to the word set, and obtaining a word vector of the character according to the vector of the character in each word set; inputting the vectors of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence; at the moment, the target sentence is subjected to word segmentation processing through a preset domain dictionary so as to improve the accuracy of word segmentation; and then according to the word segmentation set, obtaining the vector of each character in each group of word sets, and obtaining the word vector of the character, so that the representation information of each character in the target word can be effectively enhanced, and on the basis of improving the accuracy of word segmentation and enhancing the representation information of each character, the accuracy of sequence labeling output by a sequence labeling model can be improved, thereby realizing the effect of effectively improving the accuracy of sequence labeling, and improving the accuracy of subsequent processing such as subsequent semantic recognition and the like.
To the above embodiment, a method for labeling a sequence of a sentence is provided, and an embodiment of the present invention further provides a device for labeling a sequence of a sentence, referring to fig. 3, where the device includes:
a word segmentation unit 301, configured to perform word segmentation processing on the target sentence through a preset domain dictionary to obtain a word segmentation set;
a word vector determining unit 302, configured to, for each character in the target sentence, obtain, based on the word segmentation set, a vector of the character in each of N groups of word sets, and obtain, according to the vector of the character in each group of word sets, a word vector of the character, where N is an integer greater than 1;
and a sequence labeling unit 303, configured to input the vector of each character in each group of vocabulary sets into a pre-trained sequence labeling model, and perform sequence labeling on the target sentence.
In an optional implementation manner, the word vector determining unit 302 is configured to perform fusion processing on vectors of the characters in each group of vocabulary sets to obtain vocabulary vectors of the characters; and enhancing the characterization vectors of the characters by utilizing the vocabulary vectors of the characters to obtain word vectors of the characters.
In an optional implementation manner, the word vector determining unit 302 is configured to perform enhancement processing on the token vector of the character by using the vocabulary vector of the character in a preset enhancement manner, so as to obtain a word vector of the character.
In an optional implementation manner, the preset enhancement manner includes any one of a vector splicing manner and a vector mapping manner.
In an alternative embodiment, the word vector determining unit 302 is configured to, if the N word collections include a first word collection, a second word collection, a third word collection, and a fourth word collection, obtain, for each character, a first vector of the character in the first word collection, a second vector in the second word collection, a third vector in the third word collection, and a fourth vector in the fourth word collection according to a correspondence between the character and each participle in the participle collection, where the first word collection represents a word collection in which the character is a start word, the second word collection represents a word collection in which the character is a middle word, the third word collection represents a word collection in which the character is an end word, and the fourth word collection represents a word collection in which the character is an individual word.
In an optional implementation manner, the word vector determining unit 302 is configured to determine, for each character, a first matching result between the character and the first vocabulary according to the corresponding relationship, and then obtain the first vector according to the first matching result; determining a second matching result of the characters and the second vocabulary according to the corresponding relation, and acquiring the second vector according to the second matching result; determining a third matching result of the characters and the third vocabulary according to the corresponding relation, and acquiring the third vector according to the third matching result; and determining a fourth matching result of the characters and the fourth vocabulary according to the corresponding relation, and acquiring the fourth vector according to the fourth matching result.
In an optional embodiment, the method further comprises:
the model training unit is used for acquiring a training sample set, wherein each training sample in the training sample set comprises a training sentence; for each training sample in the training sample set, performing word segmentation processing on the training sample through the preset domain dictionary to obtain a training word segmentation set; aiming at each character in each training sample, acquiring a training vector of the character in each vocabulary set of the N groups of vocabulary sets based on the training word set, and acquiring a training word vector of the character according to the training vector of the character in each vocabulary set, wherein N is an integer greater than 1; and performing model training by using the training word vector of each character in each training sample to obtain the sequence labeling model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 4 is a block diagram illustrating an electronic device 800 for a method of sequence annotation of sentences in accordance with an exemplary embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/presentation (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides a presentation interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 also includes a speaker for presenting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of sequence annotation of sentences, the method comprising:
performing word segmentation processing on the target sentence through a preset domain dictionary to obtain a word segmentation set;
aiming at each character in the target sentence, acquiring a vector of the character in each group of vocabulary set in N groups of vocabulary sets based on the word segmentation set, and acquiring a word vector of the character according to the vector of the character in each group of vocabulary set, wherein N is an integer greater than 1;
and inputting the vector of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (13)

1. A method for labeling a sequence of sentences, the method comprising:
performing word segmentation processing on the target sentence through a preset domain dictionary to obtain a word segmentation set;
aiming at each character in the target sentence, acquiring a vector of the character in each group of vocabulary set in N groups of vocabulary sets based on the word segmentation set, and acquiring a word vector of the character according to the vector of the character in each group of vocabulary set, wherein N is an integer greater than 1;
and inputting the vector of each character in each group of vocabulary set into a pre-trained sequence labeling model, and performing sequence labeling on the target sentence.
2. The method of claim 1, wherein obtaining word vectors for characters based on vectors for the characters in each of the vocabulary sets comprises:
fusing vectors of the characters in each group of vocabulary sets to obtain vocabulary vectors of the characters;
and enhancing the characterization vectors of the characters by utilizing the vocabulary vectors of the characters to obtain word vectors of the characters.
3. The method of claim 2, wherein the enhancing the token vector of the character using the vocabulary vector of the character to obtain a word vector of the character comprises:
and enhancing the word vectors of the characters to the representation vectors of the characters by a preset enhancement mode to obtain the word vectors of the characters.
4. The method of claim 3, wherein the predetermined enhancement mode comprises any one of a vector splicing mode and a vector mapping mode.
5. The method of any of claims 1-4, wherein obtaining, for each character in the set of segmented words, a vector of characters in each of N sets of vocabulary sets based on the set of segmented words comprises:
if the N groups of word collections include a first word collection, a second word collection, a third word collection, and a fourth word collection, then for each character, according to a corresponding relationship between the character and each participle in the participle collection, a first vector of the character in the first word collection, a second vector in the second word collection, a third vector in the third word collection, and a fourth vector in the fourth word collection are obtained, wherein the first word collection represents a word collection in which the character is a start word, the second word collection represents a word collection in which the character is a middle word, the third word collection represents a word collection in which the character is an end word, and the fourth word collection represents a word collection in which the character is an individual word.
6. The method of claim 5, wherein obtaining, for each character, a first vector of the character in the first vocabulary set, a second vector in the second vocabulary set, a third vector in the third vocabulary set, and a fourth vector in the fourth vocabulary set based on a correspondence between the character and each of the participles in the participle set comprises:
for each character, determining a first matching result of the character and the first vocabulary according to the corresponding relation, and acquiring the first vector according to the first matching result; determining a second matching result of the characters and the second vocabulary according to the corresponding relation, and acquiring the second vector according to the second matching result; determining a third matching result of the characters and the third vocabulary according to the corresponding relation, and acquiring the third vector according to the third matching result; and determining a fourth matching result of the characters and the fourth vocabulary according to the corresponding relation, and acquiring the fourth vector according to the fourth matching result.
7. The method of claim 1, wherein the training step of the sequence annotation model comprises:
acquiring a training sample set, wherein each training sample in the training sample set comprises a training sentence;
for each training sample in the training sample set, performing word segmentation processing on the training sample through the preset domain dictionary to obtain a training word segmentation set;
aiming at each character in each training sample, acquiring a training vector of the character in each vocabulary set of the N groups of vocabulary sets based on the training word set, and acquiring a training word vector of the character according to the training vector of the character in each vocabulary set, wherein N is an integer greater than 1;
and performing model training by using the training word vector of each character in each training sample to obtain the sequence labeling model.
8. An apparatus for labeling a sequence of sentences, the apparatus comprising:
the word segmentation unit is used for performing word segmentation processing on the target sentence through a preset domain dictionary to obtain a word segmentation set;
a word vector determining unit, configured to, for each character in the target sentence, obtain, based on the word segmentation set, a vector of the character in each of N groups of word sets, and obtain, according to the vector of the character in each group of word sets, a word vector of the character, where N is an integer greater than 1;
and the sequence labeling unit is used for inputting the vector of each character in each group of vocabulary set into a pre-trained sequence labeling model and performing sequence labeling on the target sentence.
9. The apparatus according to claim 8, wherein the word vector determining unit is configured to perform a fusion process on vectors of the characters in each group of vocabulary sets to obtain vocabulary vectors of the characters; and enhancing the characterization vectors of the characters by utilizing the vocabulary vectors of the characters to obtain word vectors of the characters.
10. The apparatus of claim 9, wherein the word vector determining unit is configured to perform enhancement processing on the token vector of the character by using the vocabulary vector of the character in a preset enhancement mode to obtain the word vector of the character.
11. The apparatus of claim 10, wherein the word vector determination unit is configured to, if the N sets of word collections include a first word collection, a second word collection, a third word collection, and a fourth word collection, obtain, for each character, a first vector of the character in the first word collection, a second vector in the second word collection, a third vector in the third word collection, and a fourth vector in the fourth word collection based on a correspondence between the character and each of the participles in the participle collection, wherein the first word collection represents a word collection in which the character is a start word, the second word collection represents a word collection in which the character is a middle word, the third word collection represents a word collection in which the character is an end word, and the fourth word collection represents a word collection in which the character is an individual word.
12. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operating instructions included in the one or more programs for performing the corresponding method according to any one of claims 1 to 7.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1 to 7.
CN202111153477.XA 2021-09-29 2021-09-29 Sentence sequence labeling method and device, electronic equipment and storage medium Pending CN114065740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111153477.XA CN114065740A (en) 2021-09-29 2021-09-29 Sentence sequence labeling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111153477.XA CN114065740A (en) 2021-09-29 2021-09-29 Sentence sequence labeling method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114065740A true CN114065740A (en) 2022-02-18

Family

ID=80233857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111153477.XA Pending CN114065740A (en) 2021-09-29 2021-09-29 Sentence sequence labeling method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114065740A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343203A (en) * 2020-05-18 2020-06-26 国网电子商务有限公司 Sample recognition model training method, malicious sample extraction method and device
CN111737999A (en) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 Sequence labeling method, device and equipment and readable storage medium
CN111950256A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Sentence break processing method and device, electronic equipment and computer storage medium
CN112214994A (en) * 2020-10-10 2021-01-12 苏州大学 Word segmentation method, device and equipment based on multi-level dictionary and readable storage medium
CN113408287A (en) * 2021-06-23 2021-09-17 北京达佳互联信息技术有限公司 Entity identification method and device, electronic equipment and storage medium
CN113408268A (en) * 2021-06-22 2021-09-17 平安科技(深圳)有限公司 Slot filling method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343203A (en) * 2020-05-18 2020-06-26 国网电子商务有限公司 Sample recognition model training method, malicious sample extraction method and device
CN111950256A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Sentence break processing method and device, electronic equipment and computer storage medium
CN111737999A (en) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 Sequence labeling method, device and equipment and readable storage medium
CN112214994A (en) * 2020-10-10 2021-01-12 苏州大学 Word segmentation method, device and equipment based on multi-level dictionary and readable storage medium
CN113408268A (en) * 2021-06-22 2021-09-17 平安科技(深圳)有限公司 Slot filling method, device, equipment and storage medium
CN113408287A (en) * 2021-06-23 2021-09-17 北京达佳互联信息技术有限公司 Entity identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110580290B (en) Method and device for optimizing training set for text classification
CN111368541B (en) Named entity identification method and device
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN108304412B (en) Cross-language search method and device for cross-language search
CN109471919B (en) Zero pronoun resolution method and device
CN112528671A (en) Semantic analysis method, semantic analysis device and storage medium
CN112735396A (en) Speech recognition error correction method, device and storage medium
CN111640452B (en) Data processing method and device for data processing
CN113343720A (en) Subtitle translation method and device for subtitle translation
CN112035651B (en) Sentence completion method, sentence completion device and computer readable storage medium
CN111723606A (en) Data processing method and device and data processing device
CN113920293A (en) Information identification method and device, electronic equipment and storage medium
US11461561B2 (en) Method and device for information processing, and storage medium
CN111832297A (en) Part-of-speech tagging method and device and computer-readable storage medium
CN111414766B (en) Translation method and device
CN111324214A (en) Statement error correction method and device
CN114065740A (en) Sentence sequence labeling method and device, electronic equipment and storage medium
CN110968246A (en) Intelligent Chinese handwriting input recognition method and device
CN114550691A (en) Multi-tone word disambiguation method and device, electronic equipment and readable storage medium
CN111414731B (en) Text labeling method and device
CN112579767A (en) Search processing method and device for search processing
CN110334338B (en) Word segmentation method, device and equipment
CN110931013B (en) Voice data processing method and device
CN112861531B (en) Word segmentation method, device, storage medium and electronic equipment
WO2022105229A1 (en) Input method and apparatus, and apparatus for inputting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination