CN113505828A - Word segmentation method, device and equipment for multi-source information fusion - Google Patents

Word segmentation method, device and equipment for multi-source information fusion Download PDF

Info

Publication number
CN113505828A
CN113505828A CN202110776250.4A CN202110776250A CN113505828A CN 113505828 A CN113505828 A CN 113505828A CN 202110776250 A CN202110776250 A CN 202110776250A CN 113505828 A CN113505828 A CN 113505828A
Authority
CN
China
Prior art keywords
word segmentation
word
information
vector
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110776250.4A
Other languages
Chinese (zh)
Inventor
顾敏
杜向阳
徐芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Aijuesi Information Technology Co ltd
Original Assignee
Shanghai Aijuesi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Aijuesi Information Technology Co ltd filed Critical Shanghai Aijuesi Information Technology Co ltd
Priority to CN202110776250.4A priority Critical patent/CN113505828A/en
Publication of CN113505828A publication Critical patent/CN113505828A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a multi-source information fusion word segmentation method, device and equipment. A word segmentation method for multi-source information fusion comprises the following steps: generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified; in a fusion layer of the multi-source fusion model, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector; and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized. The method adopts a multi-source information fusion word segmentation method, integrates unitary information, binary information and dependency syntax information, effectively learns context information and external resource information, gives word segmentation results, and improves the accuracy and recall rate of word segmentation.

Description

Word segmentation method, device and equipment for multi-source information fusion
Technical Field
The application relates to the technical field of word segmentation processing, in particular to a word segmentation method, device and equipment for multi-source information fusion.
Background
In the prior art, a word segmentation method is mainly oriented to a general data set, the utilized characteristics of word segmentation are single, generally, only unitary information or binary information is adopted for word segmentation, and the information cannot meet the requirements of a certain specific scene, such as legal scene requirements. There are a lot of legal entities in the legal field, and a general word segmentation model cannot accurately segment such legal entities, for example, in the "public welfare law of the people's republic of China", the "public welfare law of China" and the "law" have a dependency relationship: a "fixed relationship" indicates that two words are a modifier and a center, respectively, and that the two words should be joined together as a phrase in the legal field. However, if the existing word segmentation method is adopted, the word segmentation method can be mistakenly divided into two words of the people's republic of China and the criminal law. Because the existing word segmentation method does not consider dependency syntax information, the word segmentation is often not accurate enough, and the word segmentation result can cause obstacles to legal reading and bring poor experience.
Disclosure of Invention
The present application mainly aims to provide a word segmentation method, device and apparatus for multi-source information fusion, so as to solve the above problems.
In order to achieve the above object, according to one aspect of the present application, there is provided a multi-source information fused word segmentation method, including:
generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;
in a fusion layer of the multi-source fusion model, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector;
and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.
Further, generating a unary information feature vector of the sentence to be recognized includes:
generating a unary character sequence of the sentence to be recognized;
setting a word segmentation granularity label of the sentence to be input;
and according to the word segmentation granularity label, using a BERT model to perform recognition coding output on the statement to be recognized, and obtaining a unitary information vector of the statement to be recognized.
Further, generating a binary information feature vector of the sentence to be recognized includes:
generating a binary character sequence of the sentence to be recognized;
and querying a static word vector table to obtain a binary information vector of the binary character sequence.
Further, generating the dependency syntax information feature vector of the sentence to be recognized includes:
for any character, acquiring a context feature set and a syntactic feature set of the character;
coding the context feature set and the syntactic feature set to obtain a context feature vector and a syntactic feature vector;
obtaining a dependency syntax information feature vector of the character according to the syntax feature output vector and the context feature output vector;
and summing the dependency syntax information characteristic vectors of each character to obtain the dependency syntax information characteristic vector of the sentence to be identified.
Further, after the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized, the method further comprises the following steps:
and correcting the word segmentation result by adopting a preset user-defined word list.
Further, the method for correcting the word segmentation result by adopting a preset user-defined word list comprises the following steps:
judging whether the self-defined word list contains word segmentation results or not; if yes, judging whether the subset of the word segmentation result exists in the user-defined word list or not, or whether a long word group containing the word segmentation result exists or not;
if yes, reading the subset of the word segmentation result or the long word group containing the word segmentation result in the user-defined word list as a candidate set;
and replacing related words in the word segmentation result according to the candidate set.
Further, replacing the related words in the word segmentation result according to the candidate set includes:
judging whether the self-defined word list contains word frequency;
if yes, determining the word with the highest word frequency in the candidate set;
determining related words of words with the highest word frequency in the candidate set in word segmentation results;
and in the word segmentation result, replacing related words with the highest word frequency to obtain a corrected word segmentation result.
Further, if the user-defined vocabulary does not contain the word frequency, the method further comprises:
determining the phrase with the longest word length in the candidate set;
determining a word related to the phrase with the longest word length in the word segmentation result;
and in the word segmentation result, replacing the related word by the word group with the longest word length to obtain a corrected word segmentation result.
In a second aspect, the present application further provides a multi-source information fused word segmentation apparatus, including:
the characteristic vector generating module is used for generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;
the fusion output module is used for fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector in a fusion layer of the multi-source fusion model; and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.
In a third aspect, the present application further provides an electronic device, including: at least one processor and at least one memory; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method of any one of the above.
According to a fourth aspect of the present application, there is provided a computer readable storage medium having one or more program instructions embodied therein for performing the steps of any of the above.
In the embodiment of the application, the context and the external resource information are effectively learned by using the multi-source information fusion method, and the recognition of ambiguous words and the recall rate of unknown words can be improved. Meanwhile, the auxiliary of the self-defined word list can meet the self-defined requirement of a user, and the recall rate of the word segmentation result is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a flow diagram of a multi-granular fused word segmentation method according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a multi-source information fused segmentation model according to an embodiment of the present application;
fig. 3 is a flowchart of correcting the word segmentation result by using a preset custom vocabulary according to an embodiment of the present application;
fig. 4 is a flowchart of replacing related words in the word segmentation result according to the candidate set according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The application provides a multi-granularity fused word segmentation method, which is shown in a flow chart of a multi-source information fused word segmentation method shown in an attached figure 1; the method comprises the following steps:
step S102, generating a unary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;
step S104, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector in a fusion layer of the multi-source fusion model;
and S106, outputting the word segmentation result of the sentence to be recognized on an output layer of the multi-source fusion model.
Referring to fig. 2, a schematic structural diagram of a multi-source information fusion segmentation model is shown, and the multi-source information fusion segmentation model includes a BERT model. Inputting a statement to be recognized into a BERT model, and outputting a unitary information characteristic vector of the statement to be recognized by the BERT model;
the multi-source information fusion word segmentation model comprises a fusion layer, and the univariate information feature vector, the binary information feature vector and the dependency syntax information feature vector of a sentence to be identified can be fused. And outputting a final word segmentation result through the attention layer.
The method of the invention integrates the unitary information, the binary information and the dependency syntax information, effectively learns the context information and the external resource information, gives the word segmentation result and improves the accuracy and the recall rate of the word segmentation.
In order to train the BERT model and the multi-source information fusion model, a data sample set needs to be established first, and the sample data needs to be labeled manually.
Regarding the design and labeling framework of word segmentation rules, firstly, coarse-grained and fine-grained word segmentation rules are designed according to the modern Chinese word segmentation specifications (GB/T13715-92) for information processing and legal professional knowledge, and legal knowledge engineers label word segmentation data sets according to the word segmentation rules. The information processing method comprises the steps of designing a fine-grained word segmentation rule according to modern Chinese word segmentation specifications (GB/T13715-92), and adopting a BMES four-bit sequence marking method, wherein B represents a word initial position value, M represents a word middle position, E represents a word tail position, and S represents a single word. Secondly, counting words or phrases with legal meaning as the basis of the coarse-grained word segmentation standard. Thirdly, designing a coarse-grained word segmentation rule according to the words or phrases with legal meanings obtained through statistics. And finally, labeling and correcting the data set by a legal knowledge engineer according to the coarse-grained word segmentation rule and the fine-grained word segmentation rule.
In one embodiment, in step S102, the unary information feature vector of the sentence to be recognized includes:
generating a unary character sequence of the sentence to be recognized;
setting a word segmentation granularity label of the sentence to be input;
and according to the word segmentation granularity label, using a BERT model to perform recognition coding output on the statement to be recognized, and obtaining a unitary information vector of the statement to be recognized.
Illustratively, the character sequence and granularity tags are encoded using a BERT (Bidirectional Encoder replication from transforms, BERT) model, generating a high-dimensional dense univariate information vector Representation.
For a sentence: x ═ X1,x2,…,xtAdding word-segmentation granularity labels, such as [ tc ] to input sentences]"indicates coarse particle size," [ tf []"means fine-grained, composing the input sequence:
I={[CLS],[tc],x1,x2,…,xt,[SEP]};
a unary vector representation (unigram embedding) H is obtained based on the pre-trained model BERT.
H=BERT(X);
In one embodiment, in step S102, generating a binary information feature vector of the sentence to be recognized includes:
generating a binary character sequence of the sentence to be recognized;
and querying a static word vector table to obtain a binary information vector of the binary character sequence.
Illustratively, combining the current character and the next character in the input sequence results in a binary character sequence { [ CLS { [][tc],[tc]x1,x1x2,x2x3,…,xt-1xt,xt[SEP]And inquiring a pre-trained static word vector table to obtain binary vector representation (binary embedding) E. The static word vector is a pre-trained 100-dimensional bigram embedding obtained from fastNLP.
E=BigramEmbed(B)。
After the unitary vector and the binary vector are obtained, the unitary vector and the binary vector are spliced, a gating mechanism is used for obtaining the vector joint representation related to the context, the unitary grammar is embedded into the unigram embedding and the binary grammar is embedded into the binary embedding and fused, and the fused representation F is obtained.
h′t=tanh(Whht+bh)
e′t=tanh(Weet+be)
gt=σ(Wfhht+Wfeet+bf)
ft=gt⊙h′t+(1-gt)⊙e′t
In one embodiment, in step S102, generating a dependency syntax information feature vector of the sentence to be recognized includes:
for any character, acquiring a context feature set and a syntactic feature set of the character;
coding the context feature set and the syntactic feature set to obtain a context feature vector and a syntactic feature vector;
obtaining a dependency syntax information feature vector of the character according to the syntax feature output vector and the context feature output vector;
and summing the dependency syntax information characteristic vectors of each character to obtain the dependency syntax information characteristic vector of the sentence to be identified.
Illustratively, for the current character, the Stanford parser is used to obtain the word in which the character is located and the dominant word and the dependency label in the dependency syntax tree, which constitute the context feature and the syntax feature. For example: "is drunk driving a law violation or a crime? The word "middle" wine "is located in" drunk "in the dependency syntax tree, the dominant word is" driving "," the dependency label of drunk "is" nsubj ", and the dependency label of driving" is "root". For the word "wine", a contextual feature c may be obtained1After drinking, drive]And syntactic characteristics d1After drunk _ nsubj, drive _ root]。
For each character xiDefining a set of contextual features ci=[ci,1,ci,2,…,ci,m]And a set of syntactic features di=[di,1,di,2,…,di,m]. The model encodes context and syntax features, and different context and syntax knowledge are compared and weighted in the attention channels to which the different context and syntax knowledge belong, so that the contributions of the different context and syntax knowledge in a specific context are identified. Thus, the attention weight of a contextual feature is defined as:
Figure BDA0003154793460000071
hiis the character xiThe representation of the unary vector is,
Figure BDA0003154793460000081
is ci,jIs represented by a vector of (a).
Contextual feature output vector
Figure BDA0003154793460000082
Is that
Figure BDA0003154793460000083
And their corresponding attention weights.
Figure BDA0003154793460000084
The syntactic characteristic output vector is also calculated by a similar method, and finally the syntactic characteristic output vector is cascaded with the context characteristic output vector to obtain a dependency syntactic information vector ai
Figure BDA0003154793460000085
The dependency syntax information vector is concatenated with the unary vector and the binary vector fused by the gating mechanism, and because the fused information is still at the character level and lacks context knowledge, the fused representation needs to be contextualized, and multi-head attribution is used to acquire the context representation to obtain the final representation.
Figure BDA0003154793460000086
And finally, taking out the first-dimension vector represented by the unary information vector, and predicting the word segmentation granularity label by using a classifier. Meanwhile, according to the information interaction result, a Conditional Random Field (CRF) is used for decoding, and word segmentation labels are predicted to obtain a final word segmentation result.
Figure BDA0003154793460000087
In addition, due to the fact that the number of professional terms in the legal field is large, a data set cannot be completely covered, and the recall rate of word segmentation results is reduced due to the fact that words are not input. In order to correct the obtained word segmentation result and make the word segmentation result have a better effect, in an implementation manner, after a text to be recognized is input into a trained legal word segmentation model for word segmentation to obtain a word segmentation result with a mixed coarse granularity and a fine granularity, a preset user-defined word list is used to correct the word segmentation result, referring to fig. 3, which specifically includes the following steps:
step S301, judging whether the self-defined word list contains word segmentation results; if yes, go to step S302;
step S302, judging whether the user-defined word list has the subset of the word segmentation result or not, or whether a long word group containing the word segmentation result exists or not;
if yes, executing step S303; if not, executing step S305;
step S305, outputting word segmentation results;
step S303, reading the subset of the word segmentation results or the long word group containing the word segmentation results in the user-defined word list as a candidate set;
step S304, relevant words in the word segmentation result are replaced according to the candidate set;
referring to fig. 4, the following steps are specifically adopted in the step:
step S3041, judging whether the self-defined word list contains word frequency;
if so, go to step S3042; if not, go to step S3045;
step S3042, determining the word with the highest word frequency in the candidate set;
step S3043, determining a related word of the word with the highest word frequency in the candidate set in the word segmentation result;
step S3044, in the word segmentation result, replacing the related word with the highest word frequency to obtain a modified word segmentation result.
Illustratively, if the word in the candidate set is "the people's republic of China criminal law", and the related words in the word segmentation result are "the people's republic of China" and "criminal law";
the Chinese people's republic of China's criminal law has the highest word frequency in the candidate set, and replaces the Chinese people's republic of China' and the criminal law in the word segmentation result with the Chinese people's republic of China's criminal law.
Step S3048 is executed;
step S3045, determining the phrase with the longest word length in the candidate set;
step S3046, determining the words related to the phrase with the longest word length in the word segmentation result;
step S3047, in the word segmentation result, replacing the related word with the word group with the longest word length to obtain a corrected word segmentation result.
Illustratively, if the longest word in the candidate set is "the people's republic of china" and the related words in the word segmentation result are "the people's republic of china" and "the criminal law";
the ' Chinese people's republic of China ' and the ' criminal law ' in the word segmentation result are replaced by the ' Chinese people's republic of China ' criminal law '.
Step S3048, outputting the corrected word segmentation result.
From the above description, it can be seen that the present invention achieves the following technical effects: and a multi-granularity word segmentation method of multi-source information fusion is adopted, so that the word segmentation requirement in a legal scene is met. The method needs word segmentation as an aid for understanding and cognition of the text, can improve the text reading efficiency and discover specific phrases in the legal field; by using the multi-source information fusion method, context and external resource information can be effectively learned, and the recognition of ambiguous words and the recall rate of unknown words can be improved. Meanwhile, the auxiliary of the self-defined word list can meet the self-defined requirement of a user, and the recall rate of the word segmentation result is improved.
The invention is directed to the legal field, and learns sentence sequences by using a neural network model with multi-source information fusion to give word segmentation results. By using the multi-source information fusion method, context information and external syntax information are effectively extracted, the ambiguity problem and the recognition problem of unknown words are relieved, and the multi-granularity word segmentation precision is improved. And a multi-granularity joint learning method is used for sharing the expression of the bottom character sequence, so that the information sharing among the participle corpora with different granularities is realized. The auxiliary of the self-defined word list can meet the self-defined requirement of a user and improve the recall rate of the word segmentation result. By adopting the multi-granularity word segmentation based on multi-source information fusion facing the legal field, the word segmentation scheme can be selected independently according to task requirements, and the development of related language understanding tasks in the judicial field can be facilitated.
The legal tasks at the word level all depend on the result of word segmentation, such as an event extraction task, an entity recognition task, a semantic abstraction task, and the like. The multi-granularity word segmentation model oriented to legal text understanding can reduce wrong transfer of word segmentation, and the effect of downstream tasks is improved. The experimental result shows that the F1 value of the entity recognition is increased by 2% through the combined learning of word segmentation and entity recognition. For example, in the entity recognition task, "how do nine-level disabilities at traffic accident two calculate the amount of compensation? The coarse-grained participles can identify two phrases of traffic accidents and nine-level disabilities, and the two phrases can directly correspond to the entity labels, so that the entities can be well located, and the entity types can be distinguished.
Compared with the prior art, the method adopts a multi-granularity word segmentation method, and can meet the word segmentation requirements in legal scenes.
(2) Aiming at understanding and cognition of legal texts, the method can find specific phrases in the legal field, effectively learn context and external resource information by using a multi-source information fusion method, and improve the text reading efficiency by using a multi-granularity word segmentation method.
(3) The auxiliary of the self-defined word list can meet the self-defined requirements of users, the recall rate of the word segmentation result is improved, and the recognition of ambiguous words and the recall rate of unknown words are improved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to the second aspect of the present application, there is also provided a multi-source information fused word segmentation apparatus, as shown in fig. 4, the apparatus includes:
the characteristic vector generating module is used for generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;
the fusion output module is used for fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector in a fusion layer of the multi-source fusion model; and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.
According to a third aspect of the present application, there is also provided an electronic device comprising at least one processor and at least one memory; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform any of the methods described above.
In a fourth aspect, the present application also proposes a computer-readable storage medium having embodied therein one or more program instructions for executing the method of any one of the above.
The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The processor reads the information in the storage medium and completes the steps of the method in combination with the hardware.
The storage medium may be a memory, for example, which may be volatile memory or nonvolatile memory, or which may include both volatile and nonvolatile memory.
The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory.
The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DRRAM).
The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.
Those skilled in the art will appreciate that the functionality described in the present invention may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A word segmentation method for multi-source information fusion is characterized by comprising the following steps:
generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of a sentence to be identified;
in a fusion layer of the multi-source fusion model, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector;
and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.
2. The multi-source information fusion word segmentation method of claim 1, wherein generating the unary information feature vector of the sentence to be recognized comprises:
generating a unary character sequence of the sentence to be recognized;
setting a word segmentation granularity label of the sentence to be recognized;
and according to the word segmentation granularity label, using a BERT model to perform recognition coding output on the statement to be recognized, and obtaining a unitary information vector of the statement to be recognized.
3. The multi-source information fusion word segmentation method of claim 1, wherein generating the binary information feature vector of the sentence to be recognized comprises:
generating a binary character sequence of the sentence to be recognized;
and querying a static word vector table to obtain a binary information vector of the binary character sequence.
4. The multi-source information fused word segmentation method according to claim 1, wherein generating the dependency syntax information feature vector of the sentence to be recognized comprises:
for any character, acquiring a context feature set and a syntactic feature set of the character;
coding the context feature set and the syntactic feature set to obtain a context feature vector and a syntactic feature vector;
obtaining a dependency syntax information feature vector of the character according to the syntax feature output vector and the context feature output vector;
and summing the dependency syntax information characteristic vectors of each character to obtain the dependency syntax information characteristic vector of the sentence to be identified.
5. The multi-source information fusion word segmentation method according to claim 1, wherein after the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized, the method further comprises:
and correcting the word segmentation result by adopting a preset user-defined word list.
6. The multi-source information fusion word segmentation method according to claim 5, wherein the modifying the word segmentation result by using a preset custom word list comprises:
judging whether the self-defined word list contains word segmentation results or not; if yes, judging whether the subset of the word segmentation result exists in the user-defined word list or not, or whether a long word group containing the word segmentation result exists or not;
if yes, reading the subset of the word segmentation result or the long word group containing the word segmentation result in the user-defined word list as a candidate set;
and replacing related words in the word segmentation result according to the candidate set.
7. The multi-source information fusion word segmentation method according to claim 6, wherein the replacement of related words in the word segmentation result according to the candidate set comprises:
judging whether the self-defined word list contains word frequency;
if yes, determining the word with the highest word frequency in the candidate set;
determining related words of words with the highest word frequency in the candidate set in word segmentation results;
and in the word segmentation result, replacing related words with the highest word frequency to obtain a corrected word segmentation result.
8. The multi-source information fusion word segmentation method according to claim 6, wherein if the custom word list does not contain word frequency, the method further comprises:
determining the phrase with the longest word length in the candidate set;
determining a word related to the phrase with the longest word length in the word segmentation result;
and in the word segmentation result, replacing the related word by the word group with the longest word length to obtain a corrected word segmentation result.
9. A multi-source information fused word segmentation device is characterized by comprising:
the characteristic vector generating module is used for generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;
the fusion output module is used for fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector in a fusion layer of the multi-source fusion model; and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.
10. An electronic device, comprising: at least one processor and at least one memory; the memory is to store one or more program instructions; the processor, configured to execute one or more program instructions to perform the method of any of claims 1-8.
CN202110776250.4A 2021-07-08 2021-07-08 Word segmentation method, device and equipment for multi-source information fusion Pending CN113505828A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110776250.4A CN113505828A (en) 2021-07-08 2021-07-08 Word segmentation method, device and equipment for multi-source information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110776250.4A CN113505828A (en) 2021-07-08 2021-07-08 Word segmentation method, device and equipment for multi-source information fusion

Publications (1)

Publication Number Publication Date
CN113505828A true CN113505828A (en) 2021-10-15

Family

ID=78012370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110776250.4A Pending CN113505828A (en) 2021-07-08 2021-07-08 Word segmentation method, device and equipment for multi-source information fusion

Country Status (1)

Country Link
CN (1) CN113505828A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386407A (en) * 2021-12-23 2022-04-22 北京金堤科技有限公司 Word segmentation method and device for text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386407A (en) * 2021-12-23 2022-04-22 北京金堤科技有限公司 Word segmentation method and device for text

Similar Documents

Publication Publication Date Title
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN108920473A (en) A kind of data enhancing machine translation method based on similar word and synonym replacement
CN110674646A (en) Mongolian Chinese machine translation system based on byte pair encoding technology
CN111160031A (en) Social media named entity identification method based on affix perception
CN109522403A (en) A kind of summary texts generation method based on fusion coding
CN111767718A (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN116502628A (en) Multi-stage fusion text error correction method for government affair field based on knowledge graph
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114333838A (en) Method and system for correcting voice recognition text
CN113505828A (en) Word segmentation method, device and equipment for multi-source information fusion
CN114154504A (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN112199952B (en) Word segmentation method, multi-mode word segmentation model and system
CN113505592A (en) Multi-granularity fused word segmentation method, device, equipment and storage medium
CN111368531B (en) Translation text processing method and device, computer equipment and storage medium
CN111563534B (en) Task-oriented word embedding vector fusion method based on self-encoder
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN113360601A (en) PGN-GAN text abstract model fusing topics
CN113761883A (en) Text information identification method and device, electronic equipment and storage medium
CN116991875A (en) SQL sentence generation and alias mapping method and device based on big model
CN116842944A (en) Entity relation extraction method and device based on word enhancement
CN116187304A (en) Automatic text error correction algorithm and system based on improved BERT
CN116681061A (en) English grammar correction technology based on multitask learning and attention mechanism
CN114970537B (en) Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination