CN110750993A - Word segmentation method, word segmentation device, named entity identification method and system - Google Patents

Word segmentation method, word segmentation device, named entity identification method and system Download PDF

Info

Publication number
CN110750993A
CN110750993A CN201910978522.1A CN201910978522A CN110750993A CN 110750993 A CN110750993 A CN 110750993A CN 201910978522 A CN201910978522 A CN 201910978522A CN 110750993 A CN110750993 A CN 110750993A
Authority
CN
China
Prior art keywords
word
segmentation
words
dictionary
segmented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910978522.1A
Other languages
Chinese (zh)
Inventor
张发展
刘世林
罗镇权
李焕
曾途
尹康
杨李伟
吴桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Business Big Data Technology Co Ltd
Original Assignee
Chengdu Business Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Business Big Data Technology Co Ltd filed Critical Chengdu Business Big Data Technology Co Ltd
Priority to CN201910978522.1A priority Critical patent/CN110750993A/en
Publication of CN110750993A publication Critical patent/CN110750993A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a word segmentation method, a word segmentation device, a named entity identification method and a named entity identification system, wherein the word segmentation method comprises the following steps: constructing a dictionary; generating a prefix tree for the sentence to be segmented based on the dictionary and performing word graph scanning to generate a directed acyclic graph formed by all possible word forming conditions; searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on word frequency; and for the unknown words which do not exist in the dictionary in the sentence to be segmented, segmenting the unknown words into a plurality of characters according to the characters. In the method, the unknown words are processed and divided into single words instead of being divided into words, so that the unregistered names can be prevented from being recombined with the preceding and following words after being divided into words, and the identification accuracy of the unregistered names can be improved.

Description

Word segmentation method, word segmentation device, named entity identification method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a word segmentation method, a word segmentation device and a named entity identification method and system.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and generally includes branches such as sentence classification, information extraction, automatic summarization, and entity identification.
Word segmentation is used as the basis of natural language processing technology, and refers to a process of segmenting and synthesizing word sequences of continuous character sequences according to certain specifications. English is written by dividing words through spaces, so that words can be directly divided according to the spaces, while Chinese can generally divide words, sentences and paragraphs according to special symbols, but does not have formal separators for the words. Therefore, Chinese word segmentation is more difficult than English word segmentation.
Current word segmentation techniques typically include string matching based segmentation methods, understanding based segmentation methods, and statistics based segmentation methods. A common characteristic of these word segmentation methods is to divide a sentence into more words as much as possible, which results in that word segmentation errors are inevitably generated during the word segmentation process, especially for some unknown words (i.e. words that do not appear in the dictionary), and the word segmentation errors further affect the subsequent NLP task, for example, may result in that the named entity cannot be accurately identified in the named entity recognition application.
Disclosure of Invention
The invention aims to overcome the defect of low word segmentation accuracy in the prior art, and provides a word segmentation method, a word segmentation device, a named entity identification method and a named entity identification system using the word segmentation method so as to improve the accuracy of word segmentation results.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
a method of word segmentation comprising the steps of:
constructing a dictionary;
generating a prefix tree for the sentence to be segmented based on the dictionary, carrying out word graph scanning, and generating a directed acyclic graph formed by all possible word forming conditions;
finding out the maximum segmentation combination based on the word frequency by adopting a method of searching a maximum probability path by dynamic planning;
and for the unknown words which do not exist in the dictionary in the sentence to be segmented, segmenting the unknown words into a plurality of characters according to the characters.
On the other hand, the embodiment of the invention provides a word segmentation device, which comprises a dictionary construction module, a directed acyclic graph generation module, a segmentation combination module and a character segmentation module; wherein the content of the first and second substances,
the dictionary construction module is used for constructing a dictionary;
the directed acyclic graph generating module is used for generating a prefix tree for the sentences to be segmented and performing word graph scanning on the prefixes based on the dictionary to generate a directed acyclic graph formed by all possible word forming conditions;
the segmentation combination module is used for finding out the maximum probability path method by adopting dynamic programming to find out the maximum segmentation combination based on the word frequency;
the character segmentation module is used for segmenting the unknown words which do not exist in the dictionary in the sentence to be segmented according to characters and segmenting the unknown words into a plurality of characters.
On the other hand, the embodiment of the invention provides a named entity identification method, which comprises the following steps:
according to the word segmentation method, performing word segmentation on the sentence to be recognized to obtain a word sequence after word segmentation;
and inputting the word sequence into a pre-trained NER model based on the word sequence, and outputting to obtain a recognition result.
On the other hand, an embodiment of the present invention further provides a named entity recognition system, including:
the word segmentation device provided by the embodiment of the invention is used for segmenting words of a sentence to obtain a word sequence after word segmentation; the sentences comprise sentences to be recognized, corpora used in model training and labeled samples;
the model training module is used for training to obtain an NER model based on a word sequence;
and the recognition module is used for inputting the word sequence obtained by segmenting the sentence to be recognized into the NER model based on the word sequence and outputting to obtain a recognition result.
In another aspect, an embodiment of the present invention also provides an electronic device, including: a memory storing program instructions; and the processor is connected with the memory and executes the program instructions in the memory to realize the steps of the method in the embodiment of the invention.
Compared with the prior art, the invention provides a new word segmentation idea aiming at word segmentation, which ensures the accuracy of word segmentation as much as possible, namely, the word segmentation is carried out as much as possible at the position without ambiguity of the word segmentation so as to reduce the length of an input sequence, and fragmentation processing is carried out on the part with ambiguity of the word segmentation, especially on the unknown word, and the word sequence is directly adopted for representation, so that the influence on the subsequent NER task caused by word segmentation errors is avoided. For a large-scale language model with more parameters such as BERT, the display memory occupation in the calculation process is reduced by converting input word sequences into word sequences. Therefore, the method and the device can improve the efficiency of named entity recognition on one hand and improve the accuracy of recognition on the other hand.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of the word segmentation method described in embodiment 1.
Fig. 2 is a block diagram showing the components of the word segmentation system described in embodiment 1.
Fig. 3 is a flowchart of a named entity recognition method in embodiment 2.
Fig. 4 is a flowchart of the NER model training based on word sequences in example 2.
Fig. 5 is a comparison graph of video memory occupation under different parameters for the two methods.
Fig. 6 is a graph comparing input matrix sizes for two methods.
Fig. 7 is a schematic block diagram of the named entity recognition system described in embodiment 2.
Fig. 8 is a block diagram showing the components of the electronic apparatus described in the embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment schematically provides a word segmentation method, including the following steps:
in step 1 (S1 in the figure), a dictionary is constructed.
The word segmentation method provided by the embodiment is particularly suitable for Chinese sentence word segmentation, so that a Chinese word segmentation device is constructed here. The jieba is a frequently used Chinese word segmentation tool, wherein the dictionary of the jieba is directly used as the dictionary of the word segmentation device, and some less frequently used words are deleted, and the correct and frequently used words are kept as far as possible, so that the capacity of the word segmentation device is reduced. Of course, in the brief operation, the dictionary of jieba can be directly used as the dictionary of the word segmentation device without any processing.
And 2, generating a prefix tree (trie tree) for the sentence to be segmented based on the dictionary constructed in the step 1, realizing efficient word graph scanning, and generating a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of the Chinese characters in the sentence.
The dictionary converts the number of occurrences of each word into a frequency while generating the trie tree. the word graph scanning of the trie tree structure is to put words in a dictionary into one trie tree, the fact that the first words of one word are the same means that the words have the same prefix, the trie tree can be used for storage, and the trie tree structure has the advantage of being fast in searching speed. That is, the sentence in the sentence to be segmented (composed of one or more sentences) is subject to dictionary lookup operation according to the given dictionary, and all possible sentence segmentations are generated. The generation of the prefix tree and the word graph scanning both adopt the prior art, so the process is not described in detail here.
Taking the sentence "he has won back to the chief commander trust" as an example, the constructed DAG is: {0: [0],1: [1],2: [2,3],3: [3],4: [4,5],5: [5],6: [6,7],7: [7] }, wherein key is a position in a sentence, and a key is a position where a word can be formed, and a directed acyclic graph of the sentence is shown in FIG. 2.
And 3, finding out the maximum segmentation combination based on the word frequency by adopting a method for finding the maximum probability path by dynamic planning. After the directed acyclic graph exists, a best overall partitioning scheme needs to be found, for example, only one scheme between the 'main' and the 'chief' is selected, and a specific selection needs to be seen from the whole to find a global optimal scheme, and the specific method is a dynamic planning algorithm.
Specifically, firstly, searching for a word which is already segmented in a sentence to be segmented, obtaining the frequency of the searched word, and if the word does not exist, taking the frequency of the word with the minimum frequency of occurrence in a dictionary as the frequency of the word; then, according to the method for finding the maximum probability path by dynamic planning, the maximum probability is calculated reversely from right to left for the sentence, because there are too many adjectives in general, the main trunk is behind. Therefore, the accuracy is higher when calculating from right to left than when calculating from left to right, and similar to the reverse maximum matching, P (node n) -1.0, P (node n-1) -P (node n) -Max (P (the first to last word)) … are analogized in sequence, and finally, the maximum probability path is obtained, and the maximum probability segmentation combination is obtained.
And 4, for the unknown words which do not exist in the dictionary in the sentence to be segmented, the segmentation processing is not carried out any more, the unknown words are directly scattered and segmented according to characters, namely, the unknown words are segmented into individual characters.
In the word segmentation method, the three previous steps are all used for performing word segmentation by using the existing word segmentation method such as the jieba, and the unknown word is segmented into a plurality of characters by adopting a character segmentation mode alone, so that the phenomenon that the registered word and the previous word are recombined to cause inaccurate word segmentation can be avoided.
Test examples
Taking the sentence "xiang oen is a catering company" as an example, wherein "xiang oen" is an unknown word, the result identified based on the above method in this embodiment is: xiang | Eve | love | is | home | restaurant | corporation; the result of the word segmentation method based on the character string matching (jieba) is as follows: xiang Ei | is | Yi Jia | catering company.
It can be seen from the comparison of the two methods that the traditional word segmentation method based on string matching splits a named entity (or entity for short) "xiang oen" into two words, and the split words can be combined with the characters before and after the split words to form new words, so that the named entity cannot be correctly recognized in the process of recognizing the named entity; the word segmentation device (the improved word segmentation method) used in this embodiment does not re-compose the entity and other words or characters into a new word. The reason is that the entity generally refers to some nouns with specific meanings such as a name of a person, a place, a name of an organization, and the like, but the vocabularies are difficult to cover in a dictionary, and a general strategy of other word segmenters in the prior art is to segment the unknown words as much as possible according to different algorithms, so that the situation of some wrong segmentation is inevitable.
Referring to fig. 2, based on the same inventive concept, a word segmentation system, which can also be understood as a word segmenter, is also provided in the present embodiment. The word segmentation device comprises a dictionary construction module, a directed acyclic graph generation module, a segmentation combination module and a character segmentation module.
In particular, the dictionary construction module is used to construct a dictionary.
The directed acyclic graph generation module is used for generating a prefix tree for the sentence to be segmented based on the prefix dictionary, realizing efficient word graph scanning and generating a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of the Chinese characters in the sentence.
The segmentation combination module is used for finding the maximum probability path by adopting a dynamic planning method and finding the maximum segmentation combination based on the word frequency.
The character segmentation module is used for segmenting the unknown words which do not exist in the dictionary in the sentence to be segmented according to the characters and segmenting the unknown words into individual characters.
For parts not described in the present system, reference is made to the corresponding description in the foregoing method embodiments, and details are not repeated here.
Example 2
Referring to fig. 3, the present embodiment provides a method for identifying a named entity, which utilizes the word segmentation method described in embodiment 1. Specifically, the named entity identification method comprises the following steps:
step 10, according to the method described in embodiment 1, performing word segmentation on the sentence to be recognized to obtain a plurality of words constituting the sentence to be recognized, where the plurality of words constitute a word sequence. Still taking a sentence "xiang oen is a catering company" in the sentences to be recognized as an example, wherein "xiang oen is an unknown word, the result recognized based on the method described in embodiment 1 is: xiang | Eve | is | Yi | family | restaurant | Co.
And 20, inputting the word sequence obtained after word segmentation into a pre-trained NER model based on the word sequence, and outputting to obtain a recognition result, namely recognizing the named entity in the sentence to be word segmented.
Referring to fig. 4, the NER (Named Entity Recognition) model based on word sequences is obtained by training through the following steps:
step 11, collecting Chinese corpora: the crawler captures as many Chinese corpora as possible, and simultaneously contains different fields as possible. In the experiment, the Chinese sentences are processed, so that Chinese linguistic data are searched, but it is easy to understand that foreign linguistic data are searched if the method is applied to named entity recognition of foreign sentences.
Step 12, performing word segmentation on the collected and sorted Chinese corpus by using the method described in embodiment 1 to obtain a large-scale word segmented Chinese corpus.
And step 13, performing language model training based on the segmented Chinese corpus by adopting a BERT open-source model training code to obtain a segmented Chinese BERT pre-training model.
And step 14, manually labeling a small number of labeled samples required by entity identification. The labeled sample needs to be selected according to specific tasks, such as: concerning news entertainment (i.e., a model is used to identify named entities in the news entertainment), the corpus of the news entertainment is labeled, and if concerning a court announcement, the data of the court announcement needs to be labeled. Before a sample is labeled, a word segmentation device in embodiment 1 is firstly adopted to segment words, and then an entity is labeled.
And step 15, adopting a pre-trained BERT pre-training model based on words, and fine-tuning to train the NER model. And dividing the labeled sample into a training set and a testing set, wherein the training set is used for training the model, the testing set is used for verifying whether the model meets the requirements, if so, the training is ended, if not, the model parameters are adjusted, or the labeled sample amount is increased until the error meets the requirements or the iteration number reaches a threshold value.
In the named entity recognition method in this embodiment, an NRT model based on a word sequence is used, and the model input is a word sequence instead of a word sequence input in a conventional BERT model, so that recognition efficiency can be improved, and memory occupation can be reduced.
Referring to FIG. 5, FIG. 5 is a diagram of a comparison of named entity recognition based on word sequences and word sequences. During a comparative test, a batch of public opinion data, official document and other announcement data are manually marked, and three entity types are mainly related, namely a person name (person), a company name (company) and an organization name (organization); the public opinion data is divided into a training set (train _ yuqing) and a test set (test _ yuqing), and the announcement data serves as the test set (test _ gonggao). As can be seen from FIG. 5, when the model parameters are the same (the number of hidden layers: 12, the size of hidden layers: 768, and the number of Attention layers Head: 12), no matter public sentiment data is used as a test set or public announcement data is used as a test set, the NRT model based on the word sequence is superior to the BERT model based on the word sequence in effect, and meanwhile, the NRT model based on the word sequence is higher in generalization capability and has not so high requirement on the data coverage of the training set. Since the more depth model parameters are, the stronger the recognition capability is, and certainly, the more resources are occupied, so that it is tried to reduce the parameters of the NRT model based on the word sequence and observe the recognition effect during the experiment, when the model parameters are set to have the hidden layer number of 4, the hidden layer size of 512 (bytes), and the Attention layer Head number of 8, as can be seen from fig. 5, the recognition accuracy of the NRT model based on the word sequence is still better than that of the BERT model based on the word sequence. Finally, statistics is performed on the memory occupation in the model training process (the size of the batch size of each of the three models is 64, the length of the word sequence is 64, and the length of the word sequence is 40), as shown in "memory occupation" in fig. 5, when the model parameters are the same, the memory occupation of the NRT model based on the word sequence is 66.7% of that of the word sequence model, and when the calculation accuracy is the same, that is, the NRT model parameters based on the word sequence are reduced, the memory occupation of the NRT model based on the word sequence is only 16.7% of that of the word sequence model. Fig. 6 is an example of an input layer, and when performing NLP task, words or words in the input sequence are first converted into vectors corresponding to the words or words according to word vectors. Now the input sentence is 'Xiang Huo Qing' is a catering company. ", the length of the word vector is 728, if there is no word segmentation, the input matrix size obtained by directly performing embedding according to the word sequence is [ 11 × 728 ], if the word segmentation is performed by using the word segmentation device, the input is performed by using the word sequence, and after the embedding, the input matrix size obtained is [ 9 × 728 ]. If the model parameters used are the same, then at the input level, the display occupancy is about 1.23 times that based on the word sequence using the word sequence. As can be shown in fig. 5 and fig. 6, the named entity identification method according to the embodiment can improve identification efficiency, reduce the occupied space of the video memory, and improve the accuracy of the identification result.
Referring to fig. 7, this embodiment also provides a named entity recognition system, which includes the word segmenter, the model training module, and the recognition module described in embodiment 1.
Specifically, the word segmentation device is used for performing word segmentation on the sentence to obtain a word sequence after word segmentation; the sentences comprise the sentences to be recognized, and corpora and labeled samples required by training of the NER model based on the word sequence. After word segmentation by the word segmenter, other characters in the sentence are segmented into words, and unregistered names are segmented into single characters.
And the model training module is used for training to obtain an NER model based on the word sequence. Specifically, the model training module adopts a BERT open-source model training code, and carries out language model training based on a corpus after word segmentation of a word segmenter to obtain a BERT pre-training model based on word segmentation; and based on the labeled sample of the named entity which is labeled manually, adopting the word-based BERT pre-training model and the fine-tuning training NER model.
And the recognition module is used for inputting the word-required sequence obtained by segmenting the sentence to be recognized into the NER model based on the word sequence and outputting to obtain a recognition result.
The named entity recognition system is based on the same inventive concept of the recognition method, and for unclear parts, reference can be made to the related description in the introduction of the method.
On one hand, the traditional named entity recognition method based on the BERT model is based on word sequence recognition, the method firstly divides a sentence into words and carries out recognition based on the word sequence, so that the recognition efficiency can be greatly improved, and the memory occupation can be reduced, as shown in fig. 5 and fig. 6; on the other hand, the word segmentation method described in embodiment 1 is adopted to segment words, and the unregistered name is segmented into a single character instead of a word, so that the situation that the segmented word is recombined with other words in the preceding and following texts can be avoided, and the situation that the unregistered name is identified incorrectly can be avoided, and the unregistered name is generally a named entity, so that the accuracy of named entity identification can be improved by the word segmentation method.
As shown in fig. 8, the present embodiment also provides an electronic device, which may include a processor 51 and a memory 52, wherein the memory 52 is coupled to the processor 51. It is noted that this diagram is exemplary and that other types of structures may be used in addition to or in place of this structure to implement data extraction, report generation, communication, or other functionality.
As shown in fig. 8, the electronic device may further include: an input unit 53, a display unit 54, and a power supply 55. It is noted that the electronic device does not necessarily have to include all of the components shown in fig. 8. Furthermore, the electronic device may also comprise components not shown in fig. 8, reference being made to the prior art.
The processor 51, also sometimes referred to as a controller or operational control, may comprise a microprocessor or other processor device and/or logic device, the processor 51 receiving input and controlling operation of the various components of the electronic device.
The memory 52 may be one or more of a buffer, a flash memory, a hard drive, a removable medium, a volatile memory, a non-volatile memory, or other suitable devices, and may store the configuration information of the processor 51, the instructions executed by the processor 51, and other information. The processor 51 may execute a program stored in the memory 52 to realize information storage or processing, or the like. In one embodiment, a buffer memory, i.e., a buffer, is also included in the memory 52 to store the intermediate information.
The input unit 53 is used, for example, to supply statement data to the processor 51. The display unit 54 is used for displaying various results in the process, such as input sentence data, model output results, etc., and may be, for example, an LCD display, but the present invention is not limited thereto. The power supply 55 is used to provide power to the electronic device.
Embodiments of the present invention further provide a computer readable instruction, where when the instruction is executed in an electronic device, the program causes the electronic device to execute the operation steps included in the method of the present invention.
Embodiments of the present invention further provide a storage medium storing computer-readable instructions, where the computer-readable instructions cause an electronic device to execute the operation steps included in the method of the present invention.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method of word segmentation, comprising the steps of:
constructing a dictionary;
generating a prefix tree for the sentence to be segmented based on the dictionary, carrying out word graph scanning, and generating a directed acyclic graph formed by all possible word forming conditions;
finding out the maximum segmentation combination based on the word frequency by adopting a method of searching a maximum probability path by dynamic planning;
and for the unknown words which do not exist in the dictionary in the sentence to be segmented, segmenting the unknown words into a plurality of characters according to the characters.
2. A word segmentation device is characterized by comprising a dictionary building module, a directed acyclic graph generating module, a segmentation combination module and a character segmentation module; wherein the content of the first and second substances,
the dictionary construction module is used for constructing a dictionary;
the directed acyclic graph generating module is used for generating a prefix tree for the sentences to be segmented and performing word graph scanning on the prefixes based on the dictionary to generate a directed acyclic graph formed by all possible word forming conditions;
the segmentation combination module is used for finding out the maximum probability path method by adopting dynamic programming to find out the maximum segmentation combination based on the word frequency;
the character segmentation module is used for segmenting the unknown words which do not exist in the dictionary in the sentence to be segmented according to characters and segmenting the unknown words into a plurality of characters.
3. A named entity recognition method is characterized by comprising the following steps:
the word segmentation method according to claim 1, wherein the words of the sentence to be recognized are segmented to obtain a word sequence after the words are segmented;
and inputting the word sequence into a pre-trained NER model based on the word sequence, and outputting to obtain a recognition result.
4. The method of claim 3, wherein the word sequence based NER model is trained by:
preparing a corpus;
the word segmentation method according to claim 1, segmenting the prepared corpus to obtain a corpus;
performing language model training based on the segmented corpus by adopting BERT open-source model training codes to obtain a BERT pre-training model based on the segmented words;
preparing a labeled sample in a targeted manner, segmenting the labeled sample by adopting the segmentation method of claim 1, and labeling the named entity;
and based on the labeled sample, adopting the word-based BERT pre-training model and the fine-tuning training NER model.
5. A named entity recognition system, comprising:
the word segmentation device of claim 2, configured to segment words of a sentence to obtain a word sequence after the word segmentation; the sentences comprise sentences to be recognized, corpora used in model training and labeled samples;
the model training module is used for training to obtain an NER model based on a word sequence;
and the recognition module is used for inputting the word sequence obtained by segmenting the sentence to be recognized into the NER model based on the word sequence and outputting to obtain a recognition result.
6. The system of claim 5, wherein the model training module is specifically configured to:
performing language model training based on a corpus after word segmentation of a word segmenter by adopting BERT open-source model training codes to obtain a BERT pre-training model based on word segmentation;
and based on a labeled sample which is segmented and manually labeled on the named entity, adopting the word-based BERT pre-training model and the fine-tuning training NER model.
7. A computer readable storage medium comprising computer readable instructions which, when executed, cause a processor to perform the operations of the method of claim 1 or cause a processor to perform the operations of the method of any of claims 3-4.
8. An electronic device, comprising:
a memory storing program instructions;
a processor coupled to the memory, executing program instructions in the memory, implementing the steps of the method of claim 1, or implementing the steps of the method of any of claims 3-4.
CN201910978522.1A 2019-10-15 2019-10-15 Word segmentation method, word segmentation device, named entity identification method and system Pending CN110750993A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910978522.1A CN110750993A (en) 2019-10-15 2019-10-15 Word segmentation method, word segmentation device, named entity identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910978522.1A CN110750993A (en) 2019-10-15 2019-10-15 Word segmentation method, word segmentation device, named entity identification method and system

Publications (1)

Publication Number Publication Date
CN110750993A true CN110750993A (en) 2020-02-04

Family

ID=69278418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910978522.1A Pending CN110750993A (en) 2019-10-15 2019-10-15 Word segmentation method, word segmentation device, named entity identification method and system

Country Status (1)

Country Link
CN (1) CN110750993A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598417A (en) * 2020-04-29 2020-08-28 国网江苏省电力有限公司南京供电分公司 Natural language processing-based electric power non-emergency repair order distribution method
CN111597309A (en) * 2020-05-25 2020-08-28 深圳市小满科技有限公司 Similar enterprise recommendation method and device, electronic equipment and medium
CN111797247A (en) * 2020-09-10 2020-10-20 平安国际智慧城市科技股份有限公司 Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN112307759A (en) * 2020-11-09 2021-02-02 西安交通大学 Cantonese word segmentation method for irregular short text of social network
CN112765963A (en) * 2020-12-31 2021-05-07 北京锐安科技有限公司 Sentence segmentation method and device, computer equipment and storage medium
CN112948536A (en) * 2020-11-09 2021-06-11 袭明科技(广东)有限公司 Information extraction method and device for web resume page
CN113268988A (en) * 2021-07-19 2021-08-17 中国平安人寿保险股份有限公司 Text entity analysis method and device, terminal equipment and storage medium
WO2021174919A1 (en) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Method and apparatus for analysis and matching of resume data information, electronic device, and medium
CN113505197A (en) * 2021-07-07 2021-10-15 西安康奈网络科技有限公司 Method for judging high-frequency words in single public opinion event comment
WO2023005293A1 (en) * 2021-07-30 2023-02-02 平安科技(深圳)有限公司 Text error correction method, apparatus, and device, and storage medium
CN115759087A (en) * 2022-11-25 2023-03-07 成都赛力斯科技有限公司 Chinese word segmentation method and device and electronic equipment
CN116579344A (en) * 2023-07-12 2023-08-11 吉奥时空信息技术股份有限公司 Case main body extraction method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942190A (en) * 2014-04-16 2014-07-23 安徽科大讯飞信息科技股份有限公司 Text word-segmentation method and system
CN107329951A (en) * 2017-06-14 2017-11-07 深圳市牛鼎丰科技有限公司 Build name entity mark resources bank method, device, storage medium and computer equipment
CN108090039A (en) * 2016-11-21 2018-05-29 中移(苏州)软件技术有限公司 A kind of name recognition methods and device
CN108109624A (en) * 2016-11-23 2018-06-01 中国科学院声学研究所 A kind of method of estimation of the Chinese vocabulary table unregistered word ratio based on statistical law
CN109145303A (en) * 2018-09-06 2019-01-04 腾讯科技(深圳)有限公司 Name entity recognition method, device, medium and equipment
CN109378053A (en) * 2018-11-30 2019-02-22 安徽影联云享医疗科技有限公司 A kind of knowledge mapping construction method for medical image
CN109710087A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Input method model generation method and device
CN109871541A (en) * 2019-03-06 2019-06-11 电子科技大学 It is a kind of suitable for multilingual multi-field name entity recognition method
CN109992788A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 Depth text matching technique and device based on unregistered word processing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942190A (en) * 2014-04-16 2014-07-23 安徽科大讯飞信息科技股份有限公司 Text word-segmentation method and system
CN108090039A (en) * 2016-11-21 2018-05-29 中移(苏州)软件技术有限公司 A kind of name recognition methods and device
CN108109624A (en) * 2016-11-23 2018-06-01 中国科学院声学研究所 A kind of method of estimation of the Chinese vocabulary table unregistered word ratio based on statistical law
CN107329951A (en) * 2017-06-14 2017-11-07 深圳市牛鼎丰科技有限公司 Build name entity mark resources bank method, device, storage medium and computer equipment
CN109145303A (en) * 2018-09-06 2019-01-04 腾讯科技(深圳)有限公司 Name entity recognition method, device, medium and equipment
CN109378053A (en) * 2018-11-30 2019-02-22 安徽影联云享医疗科技有限公司 A kind of knowledge mapping construction method for medical image
CN109710087A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Input method model generation method and device
CN109871541A (en) * 2019-03-06 2019-06-11 电子科技大学 It is a kind of suitable for multilingual multi-field name entity recognition method
CN109992788A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 Depth text matching technique and device based on unregistered word processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
秦文,苑春法: "基于决策树的汉语未登录词识别" *
陈小荷: "自动分词中未登录词问题的一揽子解决方案" *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021174919A1 (en) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Method and apparatus for analysis and matching of resume data information, electronic device, and medium
CN111598417B (en) * 2020-04-29 2022-10-04 国网江苏省电力有限公司南京供电分公司 Natural language processing-based electric power non-emergency repair order distribution method
CN111598417A (en) * 2020-04-29 2020-08-28 国网江苏省电力有限公司南京供电分公司 Natural language processing-based electric power non-emergency repair order distribution method
CN111597309A (en) * 2020-05-25 2020-08-28 深圳市小满科技有限公司 Similar enterprise recommendation method and device, electronic equipment and medium
CN111797247A (en) * 2020-09-10 2020-10-20 平安国际智慧城市科技股份有限公司 Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN112307759A (en) * 2020-11-09 2021-02-02 西安交通大学 Cantonese word segmentation method for irregular short text of social network
CN112948536A (en) * 2020-11-09 2021-06-11 袭明科技(广东)有限公司 Information extraction method and device for web resume page
CN112307759B (en) * 2020-11-09 2024-04-12 西安交通大学 Yue language word segmentation method for irregular short text of social network
CN112765963A (en) * 2020-12-31 2021-05-07 北京锐安科技有限公司 Sentence segmentation method and device, computer equipment and storage medium
CN113505197A (en) * 2021-07-07 2021-10-15 西安康奈网络科技有限公司 Method for judging high-frequency words in single public opinion event comment
CN113268988A (en) * 2021-07-19 2021-08-17 中国平安人寿保险股份有限公司 Text entity analysis method and device, terminal equipment and storage medium
WO2023005293A1 (en) * 2021-07-30 2023-02-02 平安科技(深圳)有限公司 Text error correction method, apparatus, and device, and storage medium
CN115759087A (en) * 2022-11-25 2023-03-07 成都赛力斯科技有限公司 Chinese word segmentation method and device and electronic equipment
CN115759087B (en) * 2022-11-25 2024-02-20 重庆赛力斯凤凰智创科技有限公司 Chinese word segmentation method and device and electronic equipment
CN116579344A (en) * 2023-07-12 2023-08-11 吉奥时空信息技术股份有限公司 Case main body extraction method
CN116579344B (en) * 2023-07-12 2023-10-20 吉奥时空信息技术股份有限公司 Case main body extraction method

Similar Documents

Publication Publication Date Title
CN110750993A (en) Word segmentation method, word segmentation device, named entity identification method and system
US7035789B2 (en) Supervised automatic text generation based on word classes for language modeling
Creutz et al. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0
CN112906392B (en) Text enhancement method, text classification method and related device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
KR20110038474A (en) Apparatus and method for detecting sentence boundaries
US20140032207A1 (en) Information Classification Based on Product Recognition
JP2007087397A (en) Morphological analysis program, correction program, morphological analyzer, correcting device, morphological analysis method, and correcting method
CN115495555A (en) Document retrieval method and system based on deep learning
EP3598321A1 (en) Method for parsing natural language text with constituent construction links
CN113255331B (en) Text error correction method, device and storage medium
CN110751234A (en) OCR recognition error correction method, device and equipment
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN113553847A (en) Method, device, system and storage medium for parsing address text
CN110874408B (en) Model training method, text recognition device and computing equipment
CN111916063A (en) Sequencing method, training method, system and storage medium based on BPE (Business Process Engineer) coding
CN109002454B (en) Method and electronic equipment for determining spelling partition of target word
CN114626529B (en) Natural language reasoning fine tuning method, system, device and storage medium
CN111626059B (en) Information processing method and device
CN114625860A (en) Contract clause identification method, device, equipment and medium
CN114298048A (en) Named entity identification method and device
JP2007322984A (en) Model learning method, information extracting method, model learning device, information extracting device, model learning program, information extracting program, and recording medium where those programs are recorded

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200204