WO2022073333A1 - 基于多级词典的分词方法、装置、设备及可读存储介质 - Google Patents

基于多级词典的分词方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2022073333A1
WO2022073333A1 PCT/CN2021/088599 CN2021088599W WO2022073333A1 WO 2022073333 A1 WO2022073333 A1 WO 2022073333A1 CN 2021088599 W CN2021088599 W CN 2021088599W WO 2022073333 A1 WO2022073333 A1 WO 2022073333A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
word
feature
representation
vector
Prior art date
Application number
PCT/CN2021/088599
Other languages
English (en)
French (fr)
Inventor
李正华
周厚全
侯洋
周仕林
张民
Original Assignee
苏州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学 filed Critical 苏州大学
Publication of WO2022073333A1 publication Critical patent/WO2022073333A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of computer technology, and in particular, to a word segmentation method, apparatus, device and readable storage medium based on a multi-level dictionary.
  • Chinese word segmentation is a process of dividing the input sentence into word sequences. Additional dictionaries are usually provided for the model to alleviate the problem of insufficient manual annotated training data.
  • the current word segmentation schemes all use single-level dictionaries, ignoring the problem that the probability of different words in the dictionary is different, and also ignoring the problem that the same string becomes a word in one domain but not in another domain, resulting in The word segmentation effect of the word segmentation model is poor.
  • the word segmentation method based on a single-level dictionary still has the problem that it has little effect on the actual word segmentation effect.
  • the main reason is that dictionary knowledge is added to the word segmentation model as a soft constraint in the form of features, and the probability of word formation in the dictionary varies widely, so the impact on the word segmentation model is limited.
  • the purpose of this application is to provide a word segmentation method, device, device and readable storage medium based on a multi-level dictionary, to solve the problem that the current word segmentation models all use a single-level dictionary, resulting in poor word segmentation performance. Its specific plan is as follows:
  • the present application provides a word segmentation method based on a multi-level dictionary, including:
  • For the target sentence generate a vector representation of each character, and generate a feature representation of each character in at least two dictionaries;
  • the word label includes: the current character is the first character of the word, the current character is The last character of the word, the current character is in the middle of the word, and the current character is a word alone;
  • the target sentence is segmented according to the word formation label of each character.
  • the method before generating a vector representation of each character for the target sentence, and generating a feature representation of each character in at least two dictionaries, the method further includes:
  • the target dictionary is divided into at least two dictionaries according to the domain and/or the probability of word formation.
  • generating a vector representation of each character for the target sentence including:
  • For the target sentence generate the character n-gram feature, character repetition feature information feature and character category feature of each character as a vector representation of the character;
  • determining the word tag of each character according to the vector representation and the feature representation including:
  • the generating feature representations of each character in at least two dictionaries include:
  • generating a vector representation of each character for the target sentence including:
  • Randomly generate an embedding vector table for the target sentence, determine the vector representation of each character by querying the embedding vector table;
  • determining the word tag of each character according to the vector representation and the feature representation including:
  • a word segmentation model based on a neural network
  • feature extraction is performed on the vector representation and the feature representation to obtain a feature vector of each character, and the word label of each character is determined according to the feature vector.
  • the generating feature representations of each character in at least two dictionaries include:
  • 2-gram features, 3-gram features, 4-gram features, and 5-gram features of each character in at least two dictionaries are generated as feature representations.
  • the feature extraction is performed on the vector representation and the feature representation by using a neural network-based word segmentation model to obtain a feature vector of each character, including:
  • the vector representation and the feature representation are spliced, and feature extraction is performed on the splicing result to obtain a feature vector of each character.
  • the application provides a word segmentation device based on a multi-level dictionary, including:
  • Representation module for the target sentence, generate a vector representation of each character, and generate a feature representation of each character in at least two dictionaries;
  • Label determination module for using the word segmentation model based on machine learning technology, according to the vector representation and the feature representation, to determine the word label of each character, wherein the word label includes: the current character is the first word of the word characters, the current character is the last character of the word, the current character is in the middle of the word, and the current character is a word alone;
  • Word segmentation module used to segment the target sentence according to the word tag of each character.
  • the present application provides a word segmentation device based on a multi-level dictionary, including:
  • Memory used to store computer programs
  • Processor used to execute the computer program to implement the multi-level dictionary-based word segmentation method as described above.
  • the present application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the above-mentioned word segmentation method based on a multi-level dictionary .
  • a word segmentation method based on a multi-level dictionary includes: for a target sentence, generating a vector representation of each character, and generating a feature representation of each character in at least two dictionaries; The word segmentation model determines the word label of each character according to the vector representation and feature representation; according to the word label of each character, the target sentence is segmented.
  • the method uses at least two dictionaries to assist the word segmentation model for word segmentation.
  • a conventional vector representation is generated, but also a feature representation of the character in at least two dictionaries.
  • the feature representation determines the word-formation label for that character.
  • This method essentially improves the word segmentation performance of the overall solution by distinguishing the status and importance of different words. For example, when the above at least two dictionaries are dictionaries divided according to fields, this method enables the word segmentation model to learn the word segmentation.
  • This method can effectively improve the domain adaptability of the word segmentation model; when the above at least two dictionaries are divided according to the probability of word formation, this method enables the word segmentation model to learn the information of word formation probability. Thereby, the word segmentation accuracy of the word segmentation model is significantly improved; it can even be divided according to the domain and word probability at the same time, thereby improving the domain adaptability and word segmentation accuracy at the same time.
  • the present application also provides a word segmentation device, device and readable storage medium based on a multi-level dictionary, the technical effect of which corresponds to the technical effect of the above method, and will not be repeated here.
  • Fig. 1 is a realization flow chart of Embodiment 1 of a word segmentation method based on a multi-level dictionary provided by this application;
  • Embodiment 2 is a network structure diagram of Embodiment 2 of a word segmentation method based on a multi-level dictionary provided by the present application;
  • Embodiment 3 is a network structure diagram of Embodiment 3 of a word segmentation method based on a multi-level dictionary provided by the present application;
  • FIG. 4 is a functional block diagram of an embodiment of a word segmentation device based on a multi-level dictionary provided by this application.
  • each character not only needs its own character vector, but also constructs a dictionary feature vector based on the dictionary and context.
  • the present application provides a word segmentation method, device, device and readable storage medium based on a multi-level dictionary. At least two dictionaries are used to assist the word segmentation model to perform word segmentation. The vector representation of the character is also generated, and the feature representation of the character in at least two dictionaries is also generated, and finally the word label of the character is determined according to the vector representation and the feature representation. Finally, by distinguishing the status and importance of different words, the word segmentation performance of the overall scheme is improved.
  • the first embodiment includes:
  • the above process of generating the feature representation of each character in at least two dictionaries specifically includes: for each word, generating its feature representation in each dictionary, and performing the feature representation of the word in each dictionary. Concatenate to obtain the feature representation of the word in at least two dictionaries.
  • the above-mentioned at least two dictionaries may be dictionaries obtained by division according to their fields, or dictionaries obtained by division according to probability of word formation, or even dictionaries obtained by division according to their fields and probability of word formation at the same time.
  • the word formation probability can be divided according to a finer granularity, so as to further improve the word segmentation performance of the word segmentation model.
  • dictionary 1 is used to describe words whose probability of becoming a word in domain A is 80% to 100%
  • dictionary 2 is used to describe words in Words with a probability of 60% to 80% in domain A
  • dictionary 3 is used to describe words with a probability of 60% to 100% in domain B.
  • the word segmentation task can be regarded as a serialization labeling task, which is to label each character according to the position information of the character in the word, and then achieve the purpose of word segmentation.
  • serialization labeling task which is to label each character according to the position information of the character in the word, and then achieve the purpose of word segmentation.
  • there may be different labeling methods and correspondingly, there are word-forming labels classified according to different methods. This embodiment does not limit which word-forming labels are selected.
  • a commonly used labeling method is provided here, that is, the 4-tag labeling method.
  • word tags are divided into the following four types: the current character is the first character of the word, and the current character is the last character of the word. character, the current character is located in the middle of the word, and the current character alone forms a word.
  • word segmentation models based on machine learning technology include but are not limited to word segmentation models based on traditional discrete features and word segmentation models based on neural networks.
  • the process of generating a vector representation of each character in S101 specifically includes: using feature engineering technology to generate a vector representation of each character according to a preset feature template.
  • the feature template is used to mine internal features of the named entity and associated features between contexts.
  • the feature template can be set to: character n-gram feature, character repetition feature information feature, and character category feature.
  • the feature template can be set and adjusted according to actual requirements, and this embodiment does not limit which features are specifically selected in the feature template.
  • the process of generating the vector representation of each character in S101 specifically includes: randomly generating an embedding vector table; for the target sentence, by querying the embedding vector table to determine the vector representation of each character.
  • the process of generating the feature representation of each character in at least two dictionaries in S101 specifically includes: generating unit features and ternary features of each character in at least two dictionaries, represented as a feature.
  • the process of generating the feature representation of each character in at least two dictionaries in S101 specifically includes: generating 2-gram features and 3-gram features of each character in at least two dictionaries Features, 4-gram features, and 5-gram features are represented as features.
  • S103 Perform word segmentation on the target sentence according to the word formation label of each character.
  • the process of segmenting the target sentence according to the word-forming tag is the process of converting the sentence into a word sequence, and this embodiment will not describe this part of the content.
  • the process of determining the word label of each character according to the vector representation and feature representation described in S102 includes: using the neural network-based word segmentation model to The vector representation and feature representation are used for feature extraction, and the feature vector of each character is obtained, and the word label of each character is determined according to the feature vector.
  • this embodiment provides the following two implementations:
  • the first implementation method perform feature extraction on the vector representation and the feature representation respectively to obtain the first feature vector and the second feature vector; splicing the first feature vector and the second feature vector to obtain the feature vector of each character;
  • the second implementation method splicing the vector representation and the feature representation, and extracting the feature of the splicing result to obtain the feature vector of each character.
  • both the above two methods can implement the word segmentation scheme of this embodiment, but in the actual test process, the above first implementation method shows better word segmentation performance than the second implementation method. Therefore, this The embodiment takes the first implementation manner as a more preferred manner.
  • This embodiment provides a word segmentation method based on a multi-level dictionary, which uses at least two dictionaries to perform word segmentation with an auxiliary word segmentation model.
  • a word segmentation method based on a multi-level dictionary, which uses at least two dictionaries to perform word segmentation with an auxiliary word segmentation model.
  • it When representing a character, it not only generates a conventional vector representation, but also generates the character in at least two The feature representation in the dictionary, and finally the word label of the character is determined according to the vector representation and the feature representation.
  • the word segmentation task can be viewed as a serialization tagging task.
  • the 4-tag labeling method use B, M, E, and W to label each character.
  • B indicates that the character is the first character of the word
  • E indicates that the character is the last character of the word
  • M indicates that the character is located in the middle of the word
  • W indicates that the character is a word alone.
  • the word segmentation process is the process of finding the optimal mark y * for each character in S, so that it satisfies the following formula:
  • the dictionary division process will be described. Taking the second-level dictionary divided according to the probability of word formation as an example, the dictionary division process is described. As for the three-level or above dictionary, and the dictionary division process according to other division methods, it can be easily extended, and no examples are given here. illustrate.
  • the word is divided into multiple levels.
  • the dictionary can be divided into two levels: the first level is for words with 100% probability of word formation, and the second level is for words with non-100% probability of word formation.
  • the corpus C can be used to divide the dictionary D into two levels according to the probability of word formation, the first-level dictionary D 1 and the second-level dictionary D 2 , as follows:
  • Second-level dictionary D 2 appear in corpus C, but not always as a word, ie, the probability of becoming a word is not 100%, put those words in D 2 .
  • the first-level dictionary D 1 appears in the corpus C, and always appears as a word, that is, the probability of becoming a word is 100%, and those words are placed in D 1 .
  • the second embodiment of a word segmentation method based on a multi-level dictionary provided by the present application will be introduced in detail below.
  • the second embodiment is based on the CRF word segmentation model.
  • the word segmentation model is added. Multi-level dictionary features are added to improve word segmentation performance.
  • the CRF word segmentation model uses basic features commonly used in Chinese word segmentation: character n-gram features, character repetition feature information features, and character category features. In addition to the above three features, the CRF word segmentation model also incorporates dictionary-related features.
  • f(S,i,y i-1 ,y i ) is the feature vector returned by the feature extraction function, and w is the corresponding weight vector.
  • f(S,i,y i-1 ,y i ) extracts the aforementioned basic features according to the feature templates in Table 1 below:
  • the subscript i represents the relative position from the current character
  • ci represents the character whose position is i relative to the current character.
  • c 0 means the current character
  • c -1 means the character preceding the current character.
  • Dup() indicates whether the two characters are the same, and returns 1 if they are the same, and 0 if they are not.
  • Type(c i ) is used to indicate the type of the character. The type here refers to categories such as point symbols, English numbers, Chinese numbers, letters, etc., rather than ⁇ B,M,E,W ⁇ in 4-tag.
  • the dictionary feature template needs to be modified accordingly to reflect the word formation probability information of the words.
  • the feature templates of single-level dictionaries are shown in Table 2:
  • the goal is to maximize the likelihood.
  • the third embodiment of a word segmentation method based on a multi-level dictionary provided by the present application will be introduced in detail below.
  • the third embodiment is based on the word segmentation model based on BiLSTM-CRF, and according to the aforementioned second-level dictionary, a multi-level dictionary is added to the word segmentation model. features to improve word segmentation performance.
  • each character c not only needs to represent its own character vector e, but also constructs a dictionary feature vector t based on the secondary dictionary and context.
  • the feature vector indicates whether the string composed of c and its adjacent characters constitutes a word.
  • the feature template contains fields from 2-gram to 5-gram, and each length includes two cases: the character is the head or tail of the field, and the final The eigenvector of is a 16-dimensional 0-1 vector.
  • Figure 3 is a variant of the BiLSTM-CRF model. It is divided into three layers: the representation layer, the feature layer and the prediction layer. Each layer is described below:
  • a word embedding vector table is randomly initialized, and each vector represents a character. Through the index table, the word embedding representation of each character can be easily obtained.
  • each character c i also builds a dictionary feature vector based on the dictionary and context.
  • an n-gram string is constructed for the character c i based on the feature template.
  • the feature template is shown in Table 5:
  • the eigenvalue is 1 if the corresponding character or string is in the dictionary, 0 otherwise.
  • a binary value is produced indicating whether the string appears in the dictionary.
  • ci based on dictionary D corresponding to the output value of the k -th template.
  • ci generates an 8-dimensional 0-1 vector ti based on dictionary D. Since the two-level dictionary is divided, the feature vector of c i corresponding to the first-level dictionary D 1 is The feature vector corresponding to the second-level dictionary D 2 is The final dictionary feature vector is obtained by connecting the dictionary feature vectors at all levels:
  • LSTM is composed of four parts: input gate, forget gate, output gate and memory unit, which can be used to save useful information of context and solve long-distance dependency problems at the same time.
  • input gate ⁇ x 1 , x 2 , x 3 ... x n ⁇ . Since LSTM can effectively encode the entire sequence, the global information of x i in the sequence is obtained by encoding the entire sequence through LSTM
  • represents the sigmoid activation function
  • W and b correspond to the weights and biases of the corresponding gates, respectively.
  • a unidirectional LSTM can only encode sentence information in one direction. That is, for the ith character in the sentence, the forward Only the information of the first i characters is included, and the sequence information after the i-th character is not included. In order to make each character in the sentence contain the information before and after, two LSTMs in different directions are used to encode the sentence. Finally, the hidden layer outputs of the forward and backward LSTM are spliced to obtain the bidirectional representation of the character c i in the entire sentence sequence
  • each character c i corresponds to two vectors: e i and t i .
  • e i represents the word embedding vector of ci
  • t i represents the dictionary feature vector generated by ci based on the dictionary and context , which is formed by connecting the vectors obtained by the first-level dictionary and the second-level dictionary. Feed these three vectors into a Bi-LSTM:
  • the prediction layer adopts CRF for final label sequence optimization.
  • the final training objective is to maximize the likelihood function LL(T; ⁇ ).
  • Example 3 Based on Example 3, assuming that the target sentence is "production amount of patulin", for the word segmentation model of a single dictionary, the dictionary used by the model is ⁇ patulin, production, output ⁇ , and the input and output of the model are shown in Table 6. Show:
  • the following describes a word segmentation device based on a multi-level dictionary provided by the embodiments of the present application.
  • the multi-level dictionary-based word segmentation device described below and the multi-level dictionary-based word segmentation method described above may refer to each other correspondingly.
  • the word segmentation device based on the multi-level dictionary of this embodiment includes:
  • Representation module 401 for generating a vector representation of each character for the target sentence, and generating a feature representation of each character in at least two dictionaries;
  • Label determination module 402 used for using the word segmentation model based on machine learning technology, according to the vector representation and the feature representation, to determine the word formation label of each character, wherein the word formation label includes: the current character is the first character of the word, The current character is the last character of the word, the current character is located in the middle of the word, and the current character is a word alone;
  • Word segmentation module 403 used to segment the target sentence according to the word tag of each character.
  • the word segmentation device based on a multi-level dictionary in this embodiment is used to implement the aforementioned word segmentation method based on a multi-level dictionary. Therefore, the specific implementation of the device can be found in the embodiment part of the foregoing multi-level dictionary-based word segmentation method.
  • the representation module 401, the label determination module 402, and the word segmentation module 403 are respectively used to implement steps S101, S102, and S103 in the above-mentioned multi-level dictionary-based word segmentation method. Therefore, reference may be made to the descriptions of the corresponding partial embodiments for specific implementations thereof, which will not be described herein again.
  • the word segmentation device based on a multi-level dictionary in this embodiment is used to implement the foregoing word segmentation method based on a multi-level dictionary, its function corresponds to the function of the above method, and will not be repeated here.
  • the present application also provides a word segmentation device based on a multi-level dictionary, including:
  • Memory used to store computer programs
  • Processor for executing a computer program to implement the multi-level dictionary-based word segmentation method as described above.
  • the present application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the above-described word segmentation method based on a multi-level dictionary.
  • a software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种基于多级词典的分词方法、装置、设备及可读存储介质,用于解决当前的分词模型均采用单级词典,导致分词性能较差的问题。基于多级词典的分词方法包括:针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示(S101);利用基于机器学习技术的分词模型,根据向量表示和特征表示,确定每个字符的成词标签(S102);根据每个字符的成词标签,对目标语句进行分词(S103)。

Description

基于多级词典的分词方法、装置、设备及可读存储介质
本申请要求于2020年10月10日提交至中国专利局、申请号为202011079701.0、发明名称为“基于多级词典的分词方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种基于多级词典的分词方法、装置、设备及可读存储介质。
背景技术
汉语分词是一个将输入的句子切分成词序列的过程。通常为模型提供额外词典,来缓解人工标注训练数据不足的问题。然而,目前的分词方案均采用单级词典,忽略了词典中不同词语的成词概率不同的问题,也忽略了同一字符串在一个领域成词,而在另一种领域不成词的问题,导致分词模型的分词效果较差。
基于单级词典的分词方法,还存在对实际分词效果影响不大的问题。主要原因就在于词典知识是以特征形式,作为软约束加到分词模型中,而词典中的词语成词概率千差万别,因此对于分词模型的影响有限。
可见,当前的分词模型均采用单级词典,导致分词效果较差,如何解决该问题,提升分词性能,是亟待本领域技术人员解决的问题。
发明内容
本申请的目的是提供一种基于多级词典的分词方法、装置、设备及可读存储介质,用以解决当前的分词模型均采用单级词典,导致分词性能较差的问题。其具体方案如下:
第一方面,本申请提供了一种基于多级词典的分词方法,包括:
针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;
利用基于机器学习技术的分词模型,根据所述向量表示和所述特征表示,确定每个字符的成词标签,其中所述成词标签包括:当前字符是词的第一个字符、当前字符是词的最后一个字符、当前字符位于词的中间位置、当前字符单独成词;
根据所述每个字符的成词标签,对所述目标语句进行分词。
优选的,在所述针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示之前,还包括:
根据所属领域和/或成词概率,将目标词典划分为至少两个词典。
优选的,所述针对目标语句,生成每个字符的向量表示,包括:
针对目标语句,生成每个字符的字符n-gram特征、字符重复特征信息特征和字符类别特征,以作为该字符的向量表示;
相应的,所述根据所述向量表示和所述特征表示,确定每个字符的成词标签,包括:
利用基于传统离散特征的分词模型,根据所述向量表示和所述特征表示,确定每个字符的成词标签。
优选的,在所述基于传统离散特征的分词模型为CRF模型时,所述生成每个字符在至少两个词典中的特征表示,包括:
生成每个字符在至少两个词典中的单元特征和三元特征,以作为特征表示。
优选的,所述针对目标语句,生成每个字符的向量表示,包括:
随机生成嵌入向量表;针对目标语句,通过查询所述嵌入向量表确定每个字符的向量表示;
相应的,所述根据所述向量表示和所述特征表示,确定每个字符的成词标签,包括:
利用基于神经网络的分词模型,对所述向量表示和所述特征表示进行特征提取,得到每个字符的特征向量,并根据所述特征向量确定每个字符的成词标签。
优选的,在所述基于神经网络的分词模型为BiLSTM-CRF模型时,所述生成每个字符在至少两个词典中的特征表示,包括:
生成每个字符在至少两个词典中的2-gram特征、3-gram特征、4-gram特征、5-gram特征,以作为特征表示。
优选的,所述利用基于神经网络的分词模型,对所述向量表示和所述特征表示进行特征提取,得到每个字符的特征向量,包括:
分别对所述向量表示和所述特征表示进行特征提取,得到第一特征向量和第二特征向量;对所述第一特征向量和所述第二特征向量进行拼接,得到每个字符的特征向量;
或者,
对所述向量表示和所述特征表示进行拼接,并对拼接结果进行特征提取,得到每个字符的特征向量。
第二方面,本申请提供了一种基于多级词典的分词装置,包括:
表示模块:用于针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;
标签确定模块:用于利用基于机器学习技术的分词模型,根据所述向量表示和所述特征表示,确定每个字符的成词标签,其中所述成词标签包括:当前字符是词的第一个字符、当前字符是词的最后一个字符、当前字符位于词的中间位置、当前字符单独成词;
分词模块:用于根据所述每个字符的成词标签,对所述目标语句进行分词。
第三方面,本申请提供了一种基于多级词典的分词设备,包括:
存储器:用于存储计算机程序;
处理器:用于执行所述计算机程序,以实现如上所述的基于多级词典的分词方法。
第四方面,本申请提供了一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时用于实现如上所述的基于多级词典的分词方法。
本申请所提供的一种基于多级词典的分词方法,包括:针对目标语句, 生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;利用基于机器学习技术的分词模型,根据向量表示和特征表示,确定每个字符的成词标签;根据每个字符的成词标签,对目标语句进行分词。
可见,该方法采用至少两个词典以辅助分词模型进行分词,在对字符进行表示的时候,不仅生成常规的向量表示,还生成该字符在至少两个词典中的特征表示,最终根据向量表示和特征表示确定该字符的成词标签。该方法本质上是通过区分不同词的地位和重要性,从而提升整体方案的分词性能,例如,当上述至少两个词典是按照领域划分得到的词典时,该方法能够让分词模型学习到词的所属领域这一信息,从而有效提升分词模型的领域适应能力;当上述至少两个词典是按照成词概率划分得到的词典时,该方法能够让分词模型学习到词的成词概率这一信息,从而显著提升分词模型的分词准确性;甚至可以同时按照领域和成词概率进行划分,从而同时提升领域适应能力和分词准确性。
此外,本申请还提供了一种基于多级词典的分词装置、设备及可读存储介质,其技术效果与上述方法的技术效果相对应,这里不再赘述。
附图说明
为了更清楚的说明本申请实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请所提供的一种基于多级词典的分词方法实施例一的实现流程图;
图2为本申请所提供的一种基于多级词典的分词方法实施例二的网络结构图;
图3为本申请所提供的一种基于多级词典的分词方法实施例三的网络结构图;
图4为本申请所提供的一种基于多级词典的分词装置实施例的功能框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
目前流行的分词方法多是基于统计的,该方法比之前基于词典的分词方法在性能上有较大提升。然而,当训练语料的领域与目标领域不一致时,基于统计的分词方法的性能会大幅下降。为了解决罕见词与领域相关词语的错误切分问题,一般在统计模型中融入词典信息,目前有两种解决方案:
(1)在传统机器学习领域,以CRF分词模型为例,对于每个字符,该模型不仅使用了中文分词中常用的基本特征,还加入了词典相关的特征。
(2)在神经网络领域,以BiLSTM-CRF模型为例,每个字符不仅需要自身的字符向量,还要基于词典和上下文构建词典特征向量。
然而,上述分词模型都是采用单级词典来进行分词,由于单级词典无法体现不同词之间的差异,导致分词模型的整体分词性能受到很大限制。
针对该问题,本申请提供了一种基于多级词典的分词方法、装置、设备及可读存储介质,采用至少两个词典以辅助分词模型进行分词,在对字符进行表示的时候,不仅生成常规的向量表示,还生成该字符在至少两个词典中的特征表示,最终根据向量表示和特征表示确定该字符的成词标签。最终通过区分不同词的地位和重要性,提升了整体方案的分词性能。
下面对本申请提供的一种基于多级词典的分词方法实施例一进行介绍,参见图1,实施例一包括:
S101、针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;
具体的,上述生成每个字符在至少两个词典中的特征表示的过程,具体包括:对于每个词,生成其在每个词典中的特征表示,对该词在各个词 典中的特征表示进行拼接,得到该词在至少两个词典中的特征表示。
上述至少两个词典,可以是按照所属领域划分得到的词典,也可以是按照成词概率划分得到的词典,甚至还可以是同时按照所属领域和成词概率划分得到的词典。作为一种优选的实施方式,在同时按照所属领域和成词概率划分词典时,对于关键的领域,可以从成词概率上按照更细粒度进行划分,以进一步提升分词模型的分词性能。例如,在领域A相较于领域B更为重要时,可以按照以下方式进行词典划分:词典1用于描述在领域A内成词概率为80%至100%的词,词典2用于描述在领域A内成词概率为60%至80%的词,词典3用于描述在领域B内成词概率为60%至100%的词。
S102、利用基于机器学习技术的分词模型,根据向量表示和特征表示,确定每个字符的成词标签;
可以理解的是,分词任务可以看作序列化标注任务,就是根据字符在词中的位置信息来对每个字符来进行标注,进而达到分词的目的。在实际应用中,可能存在不同的标注方式,相应的,也存在按照不同方式进行分类的成词标签,本实施例对选用何种成词标签不做限定。
此处提供一种常用的标注方法,即4-tag标注法,在该标注法中成词标签被划分为以下四种类型:当前字符是词的第一个字符、当前字符是词的最后一个字符、当前字符位于词的中间位置、当前字符单独成词。
上述利用基于机器学习技术的分词模型,包括但不限于基于传统离散特征的分词模型和基于神经网络的分词模型。
当选用基于传统离散特征的分词模型时,S101中生成每个字符的向量表示的过程,具体包括:采用特征工程技术,根据预先设置的特征模板,生成每个字符的向量表示。其中,特征模板用于挖掘命名实体内部的特征以及上下文之间的关联特征,作为一种具体的实施方式,特征模板可以设置为:字符n-gram特征、字符重复特征信息特征和字符类别特征。实际应用中,可以根据实际需求自行设置和调整特征模板,本实施例对特征模板中具体选用何种特征不做限定。
当选用基于神经网络的分词模型时,S101中生成每个字符的向量表示 的过程,具体包括:随机生成嵌入向量表;针对目标语句,通过查询嵌入向量表确定每个字符的向量表示。
以上对在选用不同的分词模型时,如何生成常规的字符向量表示的过程进行了描述。可以理解的是,在选用不同的分词模型时,不仅常规的字符向量表示可能存在差异,字符在至少两个词典中的特征表示也可能存在差异。具体的,可以针对实际选用的分词模型,预先设置特征模板,然后根据特征模板,从至少两个词典中提取字符的特征表示。下面针对不同类型的分词模型,提供一种可行的从词典中提取字符特征表示的实现方式,可以理解的是,本实施例不局限于此:
当选用基于传统离散特征的分词模型时,S101中生成每个字符在至少两个词典中的特征表示的过程,具体包括:生成每个字符在至少两个词典中的单元特征和三元特征,以作为特征表示。
当选用基于神经网络的分词模型时,S101中生成每个字符在至少两个词典中的特征表示的过程,具体包括:生成每个字符在至少两个词典中的2-gram特征、3-gram特征、4-gram特征、5-gram特征,以作为特征表示。
S103、根据每个字符的成词标签,对目标语句进行分词。
根据成词标签对目标语句进行分词的过程,就是将语句转换为词序列的过程,本实施例不再展开描述该部分内容。
值得一提的是,当选用基于神经网络的分词模型时,S102中所述根据向量表示和特征表示,确定每个字符的成词标签的过程,具体包括:利用基于神经网络的分词模型,对向量表示和特征表示进行特征提取,得到每个字符的特征向量,并根据特征向量确定每个字符的成词标签。
其中,对于所述利用基于神经网络的分词模型,对向量表示和特征表示进行特征提取,得到每个字符的特征向量的过程,本实施例提供以下两种实现方式:
第一种实现方式:分别对向量表示和特征表示进行特征提取,得到第一特征向量和第二特征向量;对第一特征向量和第二特征向量进行拼接,得到每个字符的特征向量;
第二种实现方式:对向量表示和特征表示进行拼接,并对拼接结果进 行特征提取,得到每个字符的特征向量。
理论上来讲,以上两种方式均可以实现本实施例的分词方案,但在实际测试过程中,上述第一种实现方式相较于第二种实现方式表现出更优异的分词性能,因此,本实施例将第一种实现方式作为一种更优选的方式。
本实施例所提供一种基于多级词典的分词方法,采用至少两个词典以辅助分词模型进行分词,在对字符进行表示的时候,不仅生成常规的向量表示,还生成该字符在至少两个词典中的特征表示,最终根据向量表示和特征表示确定该字符的成词标签。通过区分不同词的地位和重要性,提升了整体方案的分词性能。
如前文所述,分词任务可以看作序列化标注任务。以采用4-tag标注法为例,用B、M、E、W来对每个字符进行标注。其中,B表示该字符是词的第一个字符,E表示该字符是词的最后一个字符,M表示该字符位于词的中间位置,W表示该字符单独成词。
假设语句序列为S={c 1,c 2,c 3...c n},标签序列为Y=(y 1,y 2,...,y n),其中c i表示目标语句中的第i个字符,y i表示第i个字符的标签,i∈[1,n],n表示字符总数,y i∈{B,M,E,W}。那么,分词过程就是对S中的每一个字符都找到最优的标记y *,使其满足下式的过程:
Figure PCTCN2021088599-appb-000001
上面对分词任务进行了形象化描述,下面分别以CRF分词模型和基于BiLSTM-CRF的分词模型为基础,对本申请的实施过程进行详细说明。
首先对词典划分过程进行说明。此处以按照成词概率划分的二级词典为例,对词典划分过程进行描述,至于三级或三级以上词典,以及按照其他划分方式的词典划分过程,可以轻易扩展得到,此处不再举例说明。
根据单词的成词概率,将单词分成多个等级。比如,在二级词典中,可以将词典分为两级:第一级是成词概率100%的词,第二级是成词概率非100%的词。假设已有一个单级词典D,一些语料C。可以利用语料C,根据成词概率,将词典D划分为两级,一级词典D 1,二级词典D 2,具体如 下:
二级词典D 2:在语料C中出现,但不总是作为一个词出现,即成词概率非100%,将那些词放到D 2中。
一级词典D 1:在语料C中出现,且总是作为一个词出现,即成词概率为100%,将那些词放到D 1中。
下面开始详细介绍本申请提供的一种基于多级词典的分词方法实施例二,实施例二以CRF分词模型为基础,根据前述二级词典,在传统CRF分词模型的基础上,为分词模型增加了多级词典特征,以提升分词性能。
本实施例中,如表1所示,CRF分词模型使用了中文分词中常用的基本特征:字符n-gram特征,字符重复特征信息特征和字符类别特征。除了上述三种特征,CRF分词模型中还融入了词典相关的特征。
对于给定输入序列S={c 1,c 2,c 3...c n},模型预测输出序列Y=(y 1,y 2,...,y n),CRF分词模型的目标是找到Y,使得Y=arg max P(Y|S),这里给出CRF中P(Y|S)的定义:
Figure PCTCN2021088599-appb-000002
这里Z(S)是一个归一化因子:
Z(s)=Σ Y′exp(Score(S,Y′))
其中Y’是指长度为S长度的所有可能输出序列。另外:
Figure PCTCN2021088599-appb-000003
其中,f(S,i,y i-1,y i)为特征抽取函数返回特征向量,w为对应的权重向量。这里f(S,i,y i-1,y i)按照如下表1的特征模版来提取前述基本特征:
表1
Figure PCTCN2021088599-appb-000004
Figure PCTCN2021088599-appb-000005
其中,下标i表示距离当前字符的相对位置,c i表示相对当前字符位置为i的字符。例如,c 0表示当前字符,c -1表示当前字符的前一个字符。Dup()表示两个字符是否相同,相同则返回1,不相同则返回0。Type(c i)用于表示字符的类型,这里的类型是指标点符号、英文数字、中文数字、字母等类别,而非4-tag中的{B,M,E,W}。
由于本实施例将单级词典划分为二级词典,因此需要相应修改字典特征模版来体现出词的成词概率信息。以从词典抽取单元特征和三元特征为例,单级词典的特征模板如表2所示:
表2
Figure PCTCN2021088599-appb-000006
Figure PCTCN2021088599-appb-000007
经过修改后,二级词典的特征模板如下表3所示:
表3
Figure PCTCN2021088599-appb-000008
假设当前考虑的字符位于句子S的j位置,则:
[f B] i,L=f B(S,j+i,D L)
[f M] i,L=f M(S,j+i,D L)
[f E] i,L=f E(S,j+i,D L)
其中,D L表示使用的是第L级词典,L=1或者2。
在训练时,目标是极大化似然,一般可以取似然函数的相反数,然后采用随机梯度下降去训练去极小化,在计算条件概率的时候,可以使用前向后向算法,最后再用维特比算法解码。
基于实施例二,假设目标语句为“棒曲霉素的生产量”,将“的”作为当前字符,将“棒曲霉素”放入一级词典,而{生产,产量}放入二级词典。那么,依照表3提取特征,最终得到该字符在二级词典中的特征表示如表4所示:
表4
Figure PCTCN2021088599-appb-000009
Figure PCTCN2021088599-appb-000010
下面开始详细介绍本申请提供的一种基于多级词典的分词方法实施例三,实施例三以基于BiLSTM-CRF的分词模型为基础,并根据前述二级词典,为分词模型增加了多级词典特征,以提升分词性能。
使用BiLSTM-CRF作为基础模型,每个字符c不仅需要代表自身的字符向量e,还要基于二级词典和上下文构建词典特征向量t。该特征向量表示由c及其邻近字符组成的字符串是否构成一个单词,特征模板包含了2-gram到5-gram的字段,每个长度包含两种情况:字符是字段的首部或尾部,最终的特征向量是一个16维的0-1向量。
图3是BiLSTM-CRF模型的一个变体。一共分为三层:表示层,特征层和预测层。下面分别对每层进行介绍:
(1)表示层
在神经网络中,会随机初始化一个词嵌入向量表,每个向量都代表了一个字符,通过索引表,可以很方便得到每个字符的词嵌入表示。
对于一个句子S={c 1,c 2,c 3...c n},c i表示句子中第i个字符,n表示句子的长度。对于句中每个字符c i,通过查表得到其对应的词嵌入表示e i
除了词嵌入表示,每个字符c i还要基于词典和上下文构建词典特征向量。首先基于特征模板为字符c i构建n-gram字符串,特征模板如表5所示:
表5
类型 特征含义
前向2-gram c i-1c i在词典1中
前向3-gram c i-2c i-1c i在词典1中
前向4-gram c i-3c i-2c i-1c i在词典1中
前向5-gram c i-4c i-3c i-2c i-1c i在词典1中
后向2-gram c ic i+1在词典1中
后向3-gram c ic i+1c i+2在词典1中
后向4-gram c ic i+1c i+2c i+3在词典1中
后向5-gram c ic i+1c i+2c i+3c i+4在词典1中
前向2-gram c i-1c i在词典2中
前向3-gram c i-2c i-1c i在词典2中
前向4-gram c i-3c i-2c i-1c i在词典2中
前向5-gram c i-4c i-3c i-2c i-1c i在词典2中
后向2-gram c ic i+1在词典2中
后向3-gram c ic i+1c i+2在词典2中
后向4-gram c ic i+1c i+2c i+3在词典2中
后向5-gram c ic i+1c i+2c i+3c i+4在词典2中
如果相应的字符或字符串在词典中,特征值为1,否则为0。
对于c i在特征模板中的每个字符串,都会产生一个二元值表示该字符串是否在词典中出现。
Figure PCTCN2021088599-appb-000011
代表c i基于词典D对应于第k个模板的输出值。最终,c i基于词典D生成了一个8维的0-1向量t i。由于划分了两级词典,c i对应一级词典D 1的特征向量为
Figure PCTCN2021088599-appb-000012
对应二级词典D 2的特征向量为
Figure PCTCN2021088599-appb-000013
最终的词典特征向量由各级词典特征向量连接得到:
Figure PCTCN2021088599-appb-000014
(2)特征层
LSTM由输入门、遗忘门、输出门和记忆单元四个部分组成,可以用 来保存上下文有用的信息,同时解决长距离依赖问题。对于一个输入向量序列X={x 1,x 2,x 3...x n}。由于LSTM可以有效地编码整个序列,所以通过LSTM对整个序列进行编码得到x i在序列中的全局信息
Figure PCTCN2021088599-appb-000015
Figure PCTCN2021088599-appb-000016
Figure PCTCN2021088599-appb-000017
Figure PCTCN2021088599-appb-000018
Figure PCTCN2021088599-appb-000019
Figure PCTCN2021088599-appb-000020
其中,
Figure PCTCN2021088599-appb-000021
分别表示第i个字符对应的输入门,遗忘门,输出门和细胞状态的输出,
Figure PCTCN2021088599-appb-000022
Figure PCTCN2021088599-appb-000023
表示第i项对应的输入向量和隐藏层向量。σ表示sigmoid激活函数,W和b分别对应相应门的权重以及偏置。
对于一个句子,单向的LSTM仅仅能编码一个方向的句子信息。即对于句子中第i个字符,前向的
Figure PCTCN2021088599-appb-000024
只包含前i个字符的信息,不包含第i个字符以后的序列信息。为了使句子中每个字符都能包含前后的信息,这里采用两个不同方向的LSTM对句子进行编码。最终,将前向后向的LSTM的隐层输出拼接,得到字符c i在整个句子序列中的双向表示
Figure PCTCN2021088599-appb-000025
Figure PCTCN2021088599-appb-000026
在表示层中,每个字符c i对应了2个向量:e i和t i。其中,e i代表c i的词嵌入向量,t i表示c i基于词典和上下文生成的词典特征向量,由一级词典和二级词典各自得到的向量连接而成。将这三个向量分别输入到一个Bi-LSTM中:
Figure PCTCN2021088599-appb-000027
Figure PCTCN2021088599-appb-000028
在输入到CRF层前,将两个隐藏层向量连接起来:
Figure PCTCN2021088599-appb-000029
(3)预测层
预测层采用CRF进行最终的标签序列优化。
全标注数据中句子的每个汉字都有一个明确的分词标签。所以在包含 N个句子的全标注数据集T中,对于一个长度为n的句子S而言,其所有可能的4 n种分词切割路径中仅有一条正确的路径Y,那么全标注CRF的学习问题就是最大化权重θ下句子S被标记为Y的似然函数LL(T;θ)。其中,Score(S,Y)表示为句子S标记为序列Y的得分,Z(S)表示句子S的4 n种可能得分之和,p(Y|S)表示句子S被标记为序列Y的概率:
Figure PCTCN2021088599-appb-000030
Figure PCTCN2021088599-appb-000031
Figure PCTCN2021088599-appb-000032
最终训练目标为最大化似然函数LL(T;θ)。
基于实施例三,假设目标语句为“棒曲霉素的生产量”,对于单一词典的分词模型,模型使用的词典为{棒曲霉素,生产,产量},模型输入和输出如表6所示:
表6
Figure PCTCN2021088599-appb-000033
对于同一目标语句,对于基于二级词典的分词模型,将“棒曲霉素”放入最高优先级的一级词典,而{生产,产量}放入次优先级的二级词典,得到的输出如表7所示:
表7
Figure PCTCN2021088599-appb-000034
下面对本申请实施例提供的一种基于多级词典的分词装置进行介绍,下文描述的一种基于多级词典的分词装置与上文描述的一种基于多级词典的分词方法可相互对应参照。
如图4所示,本实施例的基于多级词典的分词装置,包括:
表示模块401:用于针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;
标签确定模块402:用于利用基于机器学习技术的分词模型,根据向量表示和特征表示,确定每个字符的成词标签,其中所述成词标签包括:当前字符是词的第一个字符、当前字符是词的最后一个字符、当前字符位于词的中间位置、当前字符单独成词;
分词模块403:用于根据每个字符的成词标签,对目标语句进行分词。
本实施例的基于多级词典的分词装置用于实现前述的基于多级词典的分词方法,因此该装置中的具体实施方式可见前文中的基于多级词典的分词方法的实施例部分,例如,表示模块401、标签确定模块402、分词模块403,分别用于实现上述基于多级词典的分词方法中步骤S101,S102,S103。所以,其具体实施方式可以参照相应的各个部分实施例的描述,在此不再展开介绍。
另外,由于本实施例的基于多级词典的分词装置用于实现前述的基于多级词典的分词方法,因此其作用与上述方法的作用相对应,这里不再赘述。
此外,本申请还提供了一种基于多级词典的分词设备,包括:
存储器:用于存储计算机程序;
处理器:用于执行计算机程序,以实现如上文所述的基于多级词典的分词方法。
最后,本申请提供了一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时用于实现如上文所述的基于多级词典的分词方法。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上对本申请所提供的方案进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种基于多级词典的分词方法,其特征在于,包括:
    针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;
    利用基于机器学习技术的分词模型,根据所述向量表示和所述特征表示,确定每个字符的成词标签,其中所述成词标签包括:当前字符是词的第一个字符、当前字符是词的最后一个字符、当前字符位于词的中间位置、当前字符单独成词;
    根据所述每个字符的成词标签,对所述目标语句进行分词。
  2. 如权利要求1所述的方法,其特征在于,在所述针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示之前,还包括:
    根据所属领域和/或成词概率,将目标词典划分为至少两个词典。
  3. 如权利要求1所述的方法,其特征在于,所述针对目标语句,生成每个字符的向量表示,包括:
    针对目标语句,生成每个字符的字符n-gram特征、字符重复特征信息特征和字符类别特征,以作为该字符的向量表示;
    相应的,所述根据所述向量表示和所述特征表示,确定每个字符的成词标签,包括:
    利用基于传统离散特征的分词模型,根据所述向量表示和所述特征表示,确定每个字符的成词标签。
  4. 如权利要求3所述的方法,其特征在于,在所述基于传统离散特征的分词模型为CRF模型时,所述生成每个字符在至少两个词典中的特征表示,包括:
    生成每个字符在至少两个词典中的单元特征和三元特征,以作为特征表示。
  5. 如权利要求1所述的方法,其特征在于,所述针对目标语句,生成每个字符的向量表示,包括:
    随机生成嵌入向量表;针对目标语句,通过查询所述嵌入向量表确定 每个字符的向量表示;
    相应的,所述根据所述向量表示和所述特征表示,确定每个字符的成词标签,包括:
    利用基于神经网络的分词模型,对所述向量表示和所述特征表示进行特征提取,得到每个字符的特征向量,并根据所述特征向量确定每个字符的成词标签。
  6. 如权利要求5所述的方法,其特征在于,在所述基于神经网络的分词模型为BiLSTM-CRF模型时,所述生成每个字符在至少两个词典中的特征表示,包括:
    生成每个字符在至少两个词典中的2-gram特征、3-gram特征、4-gram特征、5-gram特征,以作为特征表示。
  7. 如权利要求5所述的方法,其特征在于,所述利用基于神经网络的分词模型,对所述向量表示和所述特征表示进行特征提取,得到每个字符的特征向量,包括:
    分别对所述向量表示和所述特征表示进行特征提取,得到第一特征向量和第二特征向量;对所述第一特征向量和所述第二特征向量进行拼接,得到每个字符的特征向量;
    或者,
    对所述向量表示和所述特征表示进行拼接,并对拼接结果进行特征提取,得到每个字符的特征向量。
  8. 一种基于多级词典的分词装置,其特征在于,包括:
    表示模块:用于针对目标语句,生成每个字符的向量表示,并生成每个字符在至少两个词典中的特征表示;
    标签确定模块:用于利用基于机器学习技术的分词模型,根据所述向量表示和所述特征表示,确定每个字符的成词标签,其中所述成词标签包括:当前字符是词的第一个字符、当前字符是词的最后一个字符、当前字符位于词的中间位置、当前字符单独成词;
    分词模块:用于根据所述每个字符的成词标签,对所述目标语句进行分词。
  9. 一种基于多级词典的分词设备,其特征在于,包括:
    存储器:用于存储计算机程序;
    处理器:用于执行所述计算机程序,以实现如权利要求1-7任意一项所述的基于多级词典的分词方法。
  10. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时用于实现如权利要求1-7任意一项所述的基于多级词典的分词方法。
PCT/CN2021/088599 2020-10-10 2021-04-21 基于多级词典的分词方法、装置、设备及可读存储介质 WO2022073333A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011079701.0A CN112214994B (zh) 2020-10-10 2020-10-10 基于多级词典的分词方法、装置、设备及可读存储介质
CN202011079701.0 2020-10-10

Publications (1)

Publication Number Publication Date
WO2022073333A1 true WO2022073333A1 (zh) 2022-04-14

Family

ID=74053125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088599 WO2022073333A1 (zh) 2020-10-10 2021-04-21 基于多级词典的分词方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112214994B (zh)
WO (1) WO2022073333A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214994B (zh) * 2020-10-10 2021-06-01 苏州大学 基于多级词典的分词方法、装置、设备及可读存储介质
CN114065740A (zh) * 2021-09-29 2022-02-18 北京搜狗科技发展有限公司 语句的序列标注方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530298A (zh) * 2012-07-06 2014-01-22 深圳市世纪光速信息技术有限公司 一种信息搜索方法和装置
US20180018577A1 (en) * 2016-07-12 2018-01-18 International Business Machines Corporation Generating training data for machine learning
CN108268444A (zh) * 2018-01-10 2018-07-10 南京邮电大学 一种基于双向lstm、cnn和crf的中文分词方法
CN111368541A (zh) * 2018-12-06 2020-07-03 北京搜狗科技发展有限公司 命名实体识别方法及装置
EP3683695A1 (en) * 2017-09-11 2020-07-22 SCREEN Holdings Co., Ltd. Synonym dictionary creation device, synonym dictionary creation program, and synonym dictionary creation method
CN112214994A (zh) * 2020-10-10 2021-01-12 苏州大学 基于多级词典的分词方法、装置、设备及可读存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5488366A (en) * 1993-10-12 1996-01-30 Industrial Technology Research Institute Segmented variable length decoding apparatus for sequentially decoding single code-word within a fixed number of decoding cycles
US9348809B1 (en) * 2015-02-02 2016-05-24 Linkedin Corporation Modifying a tokenizer based on pseudo data for natural language processing
CN106528536A (zh) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 一种基于词典与文法分析的多语种分词方法
CN106610955A (zh) * 2016-12-13 2017-05-03 成都数联铭品科技有限公司 基于词典的多维度情感分析方法
CN107844475A (zh) * 2017-10-12 2018-03-27 北京知道未来信息技术有限公司 一种基于lstm的分词方法
CN108647199A (zh) * 2018-03-23 2018-10-12 江苏速度信息科技股份有限公司 一种地名新词的发现方法
CN109492105B (zh) * 2018-11-10 2022-11-15 上海五节数据科技有限公司 一种基于多特征集成学习的文本情感分类方法
CN109800298B (zh) * 2019-01-29 2023-06-16 苏州大学 一种基于神经网络的中文分词模型的训练方法
CN110008475A (zh) * 2019-04-10 2019-07-12 出门问问信息科技有限公司 分词处理方法、装置、设备及存储介质
CN111209749A (zh) * 2020-01-02 2020-05-29 湖北大学 一种将深度学习应用于中文分词的方法
CN111666758B (zh) * 2020-04-15 2022-03-22 中国科学院深圳先进技术研究院 中文分词方法、训练设备以及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530298A (zh) * 2012-07-06 2014-01-22 深圳市世纪光速信息技术有限公司 一种信息搜索方法和装置
US20180018577A1 (en) * 2016-07-12 2018-01-18 International Business Machines Corporation Generating training data for machine learning
EP3683695A1 (en) * 2017-09-11 2020-07-22 SCREEN Holdings Co., Ltd. Synonym dictionary creation device, synonym dictionary creation program, and synonym dictionary creation method
CN108268444A (zh) * 2018-01-10 2018-07-10 南京邮电大学 一种基于双向lstm、cnn和crf的中文分词方法
CN111368541A (zh) * 2018-12-06 2020-07-03 北京搜狗科技发展有限公司 命名实体识别方法及装置
CN112214994A (zh) * 2020-10-10 2021-01-12 苏州大学 基于多级词典的分词方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN112214994A (zh) 2021-01-12
CN112214994B (zh) 2021-06-01

Similar Documents

Publication Publication Date Title
CN111666427B (zh) 一种实体关系联合抽取方法、装置、设备及介质
CN108829801A (zh) 一种基于文档级别注意力机制的事件触发词抽取方法
CN111666758B (zh) 中文分词方法、训练设备以及计算机可读存储介质
CN109800437A (zh) 一种基于特征融合的命名实体识别方法
Xu et al. Cross-domain and semisupervised named entity recognition in chinese social media: A unified model
CN113190656B (zh) 一种基于多标注框架与融合特征的中文命名实体抽取方法
WO2022073333A1 (zh) 基于多级词典的分词方法、装置、设备及可读存储介质
CN108415906B (zh) 基于领域自动识别篇章机器翻译方法、机器翻译系统
WO2020207179A1 (zh) 一种从视频字幕中提取概念词的方法
Yuan et al. Distant supervision for relation extraction with linear attenuation simulation and non-iid relevance embedding
CN116151132B (zh) 一种编程学习场景的智能代码补全方法、系统及储存介质
WO2023092960A1 (zh) 一种用于法律文书的命名实体识别的标注方法和装置
CN112699685B (zh) 基于标签引导的字词融合的命名实体识别方法
CN111476031A (zh) 一种基于Lattice-LSTM的改进中文命名实体识别方法
CN115906815B (zh) 一种用于修改一种或多种类型错误句子的纠错方法及装置
CN115600597A (zh) 基于注意力机制和词内语义融合的命名实体识别方法、装置、系统及存储介质
CN115438197A (zh) 一种基于双层异质图的事理知识图谱关系补全方法及系统
CN111428518B (zh) 一种低频词翻译方法及装置
CN112036186A (zh) 语料标注方法、装置、计算机存储介质及电子设备
CN116821326A (zh) 基于自注意力和相对位置编码的文本摘要生成方法及装置
Qi et al. Semi-supervised sequence labeling with self-learned features
CN116029300A (zh) 一种强化中文实体语义特征的语言模型训练方法和系统
CN115906854A (zh) 一种基于多级对抗的跨语言命名实体识别模型训练方法
Chang et al. A mixed semantic features model for chinese ner with characters and words
Wang et al. ESN-NER: Entity storage network using attention mechanism for chinese NER

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21876856

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21876856

Country of ref document: EP

Kind code of ref document: A1