CN113505828A

CN113505828A - Word segmentation method, device and equipment for multi-source information fusion

Info

Publication number: CN113505828A
Application number: CN202110776250.4A
Authority: CN
Inventors: 顾敏; 杜向阳; 徐芳
Original assignee: Shanghai Aijuesi Information Technology Co ltd
Current assignee: Shanghai Aijuesi Information Technology Co ltd
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2021-10-15

Abstract

The application discloses a multi-source information fusion word segmentation method, device and equipment. A word segmentation method for multi-source information fusion comprises the following steps: generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified; in a fusion layer of the multi-source fusion model, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector; and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized. The method adopts a multi-source information fusion word segmentation method, integrates unitary information, binary information and dependency syntax information, effectively learns context information and external resource information, gives word segmentation results, and improves the accuracy and recall rate of word segmentation.

Description

Word segmentation method, device and equipment for multi-source information fusion

Technical Field

The application relates to the technical field of word segmentation processing, in particular to a word segmentation method, device and equipment for multi-source information fusion.

Background

In the prior art, a word segmentation method is mainly oriented to a general data set, the utilized characteristics of word segmentation are single, generally, only unitary information or binary information is adopted for word segmentation, and the information cannot meet the requirements of a certain specific scene, such as legal scene requirements. There are a lot of legal entities in the legal field, and a general word segmentation model cannot accurately segment such legal entities, for example, in the "public welfare law of the people's republic of China", the "public welfare law of China" and the "law" have a dependency relationship: a "fixed relationship" indicates that two words are a modifier and a center, respectively, and that the two words should be joined together as a phrase in the legal field. However, if the existing word segmentation method is adopted, the word segmentation method can be mistakenly divided into two words of the people's republic of China and the criminal law. Because the existing word segmentation method does not consider dependency syntax information, the word segmentation is often not accurate enough, and the word segmentation result can cause obstacles to legal reading and bring poor experience.

Disclosure of Invention

The present application mainly aims to provide a word segmentation method, device and apparatus for multi-source information fusion, so as to solve the above problems.

In order to achieve the above object, according to one aspect of the present application, there is provided a multi-source information fused word segmentation method, including:

generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;

in a fusion layer of the multi-source fusion model, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector;

and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.

Further, generating a unary information feature vector of the sentence to be recognized includes:

generating a unary character sequence of the sentence to be recognized;

setting a word segmentation granularity label of the sentence to be input;

and according to the word segmentation granularity label, using a BERT model to perform recognition coding output on the statement to be recognized, and obtaining a unitary information vector of the statement to be recognized.

Further, generating a binary information feature vector of the sentence to be recognized includes:

generating a binary character sequence of the sentence to be recognized;

and querying a static word vector table to obtain a binary information vector of the binary character sequence.

Further, generating the dependency syntax information feature vector of the sentence to be recognized includes:

for any character, acquiring a context feature set and a syntactic feature set of the character;

coding the context feature set and the syntactic feature set to obtain a context feature vector and a syntactic feature vector;

obtaining a dependency syntax information feature vector of the character according to the syntax feature output vector and the context feature output vector;

and summing the dependency syntax information characteristic vectors of each character to obtain the dependency syntax information characteristic vector of the sentence to be identified.

Further, after the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized, the method further comprises the following steps:

and correcting the word segmentation result by adopting a preset user-defined word list.

Further, the method for correcting the word segmentation result by adopting a preset user-defined word list comprises the following steps:

judging whether the self-defined word list contains word segmentation results or not; if yes, judging whether the subset of the word segmentation result exists in the user-defined word list or not, or whether a long word group containing the word segmentation result exists or not;

if yes, reading the subset of the word segmentation result or the long word group containing the word segmentation result in the user-defined word list as a candidate set;

and replacing related words in the word segmentation result according to the candidate set.

Further, replacing the related words in the word segmentation result according to the candidate set includes:

judging whether the self-defined word list contains word frequency;

if yes, determining the word with the highest word frequency in the candidate set;

determining related words of words with the highest word frequency in the candidate set in word segmentation results;

and in the word segmentation result, replacing related words with the highest word frequency to obtain a corrected word segmentation result.

Further, if the user-defined vocabulary does not contain the word frequency, the method further comprises:

determining the phrase with the longest word length in the candidate set;

determining a word related to the phrase with the longest word length in the word segmentation result;

and in the word segmentation result, replacing the related word by the word group with the longest word length to obtain a corrected word segmentation result.

In a second aspect, the present application further provides a multi-source information fused word segmentation apparatus, including:

the characteristic vector generating module is used for generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;

the fusion output module is used for fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector in a fusion layer of the multi-source fusion model; and the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized.

In a third aspect, the present application further provides an electronic device, including: at least one processor and at least one memory; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method of any one of the above.

According to a fourth aspect of the present application, there is provided a computer readable storage medium having one or more program instructions embodied therein for performing the steps of any of the above.

In the embodiment of the application, the context and the external resource information are effectively learned by using the multi-source information fusion method, and the recognition of ambiguous words and the recall rate of unknown words can be improved. Meanwhile, the auxiliary of the self-defined word list can meet the self-defined requirement of a user, and the recall rate of the word segmentation result is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a flow diagram of a multi-granular fused word segmentation method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a multi-source information fused segmentation model according to an embodiment of the present application;

fig. 3 is a flowchart of correcting the word segmentation result by using a preset custom vocabulary according to an embodiment of the present application;

fig. 4 is a flowchart of replacing related words in the word segmentation result according to the candidate set according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The application provides a multi-granularity fused word segmentation method, which is shown in a flow chart of a multi-source information fused word segmentation method shown in an attached figure 1; the method comprises the following steps:

step S102, generating a unary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of the sentence to be identified;

step S104, fusing the unary information characteristic vector, the binary information characteristic vector and the dependency syntax information characteristic vector in a fusion layer of the multi-source fusion model;

and S106, outputting the word segmentation result of the sentence to be recognized on an output layer of the multi-source fusion model.

Referring to fig. 2, a schematic structural diagram of a multi-source information fusion segmentation model is shown, and the multi-source information fusion segmentation model includes a BERT model. Inputting a statement to be recognized into a BERT model, and outputting a unitary information characteristic vector of the statement to be recognized by the BERT model;

the multi-source information fusion word segmentation model comprises a fusion layer, and the univariate information feature vector, the binary information feature vector and the dependency syntax information feature vector of a sentence to be identified can be fused. And outputting a final word segmentation result through the attention layer.

The method of the invention integrates the unitary information, the binary information and the dependency syntax information, effectively learns the context information and the external resource information, gives the word segmentation result and improves the accuracy and the recall rate of the word segmentation.

In order to train the BERT model and the multi-source information fusion model, a data sample set needs to be established first, and the sample data needs to be labeled manually.

Regarding the design and labeling framework of word segmentation rules, firstly, coarse-grained and fine-grained word segmentation rules are designed according to the modern Chinese word segmentation specifications (GB/T13715-92) for information processing and legal professional knowledge, and legal knowledge engineers label word segmentation data sets according to the word segmentation rules. The information processing method comprises the steps of designing a fine-grained word segmentation rule according to modern Chinese word segmentation specifications (GB/T13715-92), and adopting a BMES four-bit sequence marking method, wherein B represents a word initial position value, M represents a word middle position, E represents a word tail position, and S represents a single word. Secondly, counting words or phrases with legal meaning as the basis of the coarse-grained word segmentation standard. Thirdly, designing a coarse-grained word segmentation rule according to the words or phrases with legal meanings obtained through statistics. And finally, labeling and correcting the data set by a legal knowledge engineer according to the coarse-grained word segmentation rule and the fine-grained word segmentation rule.

In one embodiment, in step S102, the unary information feature vector of the sentence to be recognized includes:

generating a unary character sequence of the sentence to be recognized;

setting a word segmentation granularity label of the sentence to be input;

Illustratively, the character sequence and granularity tags are encoded using a BERT (Bidirectional Encoder replication from transforms, BERT) model, generating a high-dimensional dense univariate information vector Representation.

For a sentence: x ═ X₁,x₂,…,x_tAdding word-segmentation granularity labels, such as [ tc ] to input sentences]"indicates coarse particle size," [ tf []"means fine-grained, composing the input sequence:

I＝{[CLS],[tc],x₁,x₂,…,x_t,[SEP]}；

a unary vector representation (unigram embedding) H is obtained based on the pre-trained model BERT.

H＝BERT(X)；

In one embodiment, in step S102, generating a binary information feature vector of the sentence to be recognized includes:

generating a binary character sequence of the sentence to be recognized;

Illustratively, combining the current character and the next character in the input sequence results in a binary character sequence { [ CLS { [][tc],[tc]x₁,x₁x₂,x₂x₃,…,x_t-1x_t,x_t[SEP]And inquiring a pre-trained static word vector table to obtain binary vector representation (binary embedding) E. The static word vector is a pre-trained 100-dimensional bigram embedding obtained from fastNLP.

E＝BigramEmbed(B)。

After the unitary vector and the binary vector are obtained, the unitary vector and the binary vector are spliced, a gating mechanism is used for obtaining the vector joint representation related to the context, the unitary grammar is embedded into the unigram embedding and the binary grammar is embedded into the binary embedding and fused, and the fused representation F is obtained.

h′_t＝tanh(W_hh_t+b_h)

e′_t＝tanh(W_ee_t+b_e)

g_t＝σ(W_fhh_t+W_fee_t+b_f)

f_t＝g_t⊙h′_t+(1-g_t)⊙e′_t

In one embodiment, in step S102, generating a dependency syntax information feature vector of the sentence to be recognized includes:

Illustratively, for the current character, the Stanford parser is used to obtain the word in which the character is located and the dominant word and the dependency label in the dependency syntax tree, which constitute the context feature and the syntax feature. For example: "is drunk driving a law violation or a crime? The word "middle" wine "is located in" drunk "in the dependency syntax tree, the dominant word is" driving "," the dependency label of drunk "is" nsubj ", and the dependency label of driving" is "root". For the word "wine", a contextual feature c may be obtained₁After drinking, drive]And syntactic characteristics d₁After drunk _ nsubj, drive _ root]。

For each character x_iDefining a set of contextual features c_i＝[c_i,1,c_i,2,…,c_i,m]And a set of syntactic features d_i＝[d_i,1,d_i,2,…,d_i,m]. The model encodes context and syntax features, and different context and syntax knowledge are compared and weighted in the attention channels to which the different context and syntax knowledge belong, so that the contributions of the different context and syntax knowledge in a specific context are identified. Thus, the attention weight of a contextual feature is defined as:

h_iis the character x_iThe representation of the unary vector is,

is c_i,jIs represented by a vector of (a).

Contextual feature output vector

Is that

And their corresponding attention weights.

The syntactic characteristic output vector is also calculated by a similar method, and finally the syntactic characteristic output vector is cascaded with the context characteristic output vector to obtain a dependency syntactic information vector a_i。

The dependency syntax information vector is concatenated with the unary vector and the binary vector fused by the gating mechanism, and because the fused information is still at the character level and lacks context knowledge, the fused representation needs to be contextualized, and multi-head attribution is used to acquire the context representation to obtain the final representation.

And finally, taking out the first-dimension vector represented by the unary information vector, and predicting the word segmentation granularity label by using a classifier. Meanwhile, according to the information interaction result, a Conditional Random Field (CRF) is used for decoding, and word segmentation labels are predicted to obtain a final word segmentation result.

In addition, due to the fact that the number of professional terms in the legal field is large, a data set cannot be completely covered, and the recall rate of word segmentation results is reduced due to the fact that words are not input. In order to correct the obtained word segmentation result and make the word segmentation result have a better effect, in an implementation manner, after a text to be recognized is input into a trained legal word segmentation model for word segmentation to obtain a word segmentation result with a mixed coarse granularity and a fine granularity, a preset user-defined word list is used to correct the word segmentation result, referring to fig. 3, which specifically includes the following steps:

step S301, judging whether the self-defined word list contains word segmentation results; if yes, go to step S302;

step S302, judging whether the user-defined word list has the subset of the word segmentation result or not, or whether a long word group containing the word segmentation result exists or not;

if yes, executing step S303; if not, executing step S305;

step S305, outputting word segmentation results;

step S303, reading the subset of the word segmentation results or the long word group containing the word segmentation results in the user-defined word list as a candidate set;

step S304, relevant words in the word segmentation result are replaced according to the candidate set;

referring to fig. 4, the following steps are specifically adopted in the step:

step S3041, judging whether the self-defined word list contains word frequency;

if so, go to step S3042; if not, go to step S3045;

step S3042, determining the word with the highest word frequency in the candidate set;

step S3043, determining a related word of the word with the highest word frequency in the candidate set in the word segmentation result;

step S3044, in the word segmentation result, replacing the related word with the highest word frequency to obtain a modified word segmentation result.

Illustratively, if the word in the candidate set is "the people's republic of China criminal law", and the related words in the word segmentation result are "the people's republic of China" and "criminal law";

the Chinese people's republic of China's criminal law has the highest word frequency in the candidate set, and replaces the Chinese people's republic of China' and the criminal law in the word segmentation result with the Chinese people's republic of China's criminal law.

Step S3048 is executed;

step S3045, determining the phrase with the longest word length in the candidate set;

step S3046, determining the words related to the phrase with the longest word length in the word segmentation result;

step S3047, in the word segmentation result, replacing the related word with the word group with the longest word length to obtain a corrected word segmentation result.

Illustratively, if the longest word in the candidate set is "the people's republic of china" and the related words in the word segmentation result are "the people's republic of china" and "the criminal law";

the ' Chinese people's republic of China ' and the ' criminal law ' in the word segmentation result are replaced by the ' Chinese people's republic of China ' criminal law '.

Step S3048, outputting the corrected word segmentation result.

From the above description, it can be seen that the present invention achieves the following technical effects: and a multi-granularity word segmentation method of multi-source information fusion is adopted, so that the word segmentation requirement in a legal scene is met. The method needs word segmentation as an aid for understanding and cognition of the text, can improve the text reading efficiency and discover specific phrases in the legal field; by using the multi-source information fusion method, context and external resource information can be effectively learned, and the recognition of ambiguous words and the recall rate of unknown words can be improved. Meanwhile, the auxiliary of the self-defined word list can meet the self-defined requirement of a user, and the recall rate of the word segmentation result is improved.

The invention is directed to the legal field, and learns sentence sequences by using a neural network model with multi-source information fusion to give word segmentation results. By using the multi-source information fusion method, context information and external syntax information are effectively extracted, the ambiguity problem and the recognition problem of unknown words are relieved, and the multi-granularity word segmentation precision is improved. And a multi-granularity joint learning method is used for sharing the expression of the bottom character sequence, so that the information sharing among the participle corpora with different granularities is realized. The auxiliary of the self-defined word list can meet the self-defined requirement of a user and improve the recall rate of the word segmentation result. By adopting the multi-granularity word segmentation based on multi-source information fusion facing the legal field, the word segmentation scheme can be selected independently according to task requirements, and the development of related language understanding tasks in the judicial field can be facilitated.

The legal tasks at the word level all depend on the result of word segmentation, such as an event extraction task, an entity recognition task, a semantic abstraction task, and the like. The multi-granularity word segmentation model oriented to legal text understanding can reduce wrong transfer of word segmentation, and the effect of downstream tasks is improved. The experimental result shows that the F1 value of the entity recognition is increased by 2% through the combined learning of word segmentation and entity recognition. For example, in the entity recognition task, "how do nine-level disabilities at traffic accident two calculate the amount of compensation? The coarse-grained participles can identify two phrases of traffic accidents and nine-level disabilities, and the two phrases can directly correspond to the entity labels, so that the entities can be well located, and the entity types can be distinguished.

Compared with the prior art, the method adopts a multi-granularity word segmentation method, and can meet the word segmentation requirements in legal scenes.

(2) Aiming at understanding and cognition of legal texts, the method can find specific phrases in the legal field, effectively learn context and external resource information by using a multi-source information fusion method, and improve the text reading efficiency by using a multi-granularity word segmentation method.

(3) The auxiliary of the self-defined word list can meet the self-defined requirements of users, the recall rate of the word segmentation result is improved, and the recognition of ambiguous words and the recall rate of unknown words are improved.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

According to the second aspect of the present application, there is also provided a multi-source information fused word segmentation apparatus, as shown in fig. 4, the apparatus includes:

According to a third aspect of the present application, there is also provided an electronic device comprising at least one processor and at least one memory; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform any of the methods described above.

In a fourth aspect, the present application also proposes a computer-readable storage medium having embodied therein one or more program instructions for executing the method of any one of the above.

The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The processor reads the information in the storage medium and completes the steps of the method in combination with the hardware.

The storage medium may be a memory, for example, which may be volatile memory or nonvolatile memory, or which may include both volatile and nonvolatile memory.

The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory.

The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DRRAM).

The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

Those skilled in the art will appreciate that the functionality described in the present invention may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A word segmentation method for multi-source information fusion is characterized by comprising the following steps:

generating a unitary information characteristic vector, a binary information characteristic vector and a dependency syntax information characteristic vector of a sentence to be identified;

2. The multi-source information fusion word segmentation method of claim 1, wherein generating the unary information feature vector of the sentence to be recognized comprises:

generating a unary character sequence of the sentence to be recognized;

setting a word segmentation granularity label of the sentence to be recognized;

3. The multi-source information fusion word segmentation method of claim 1, wherein generating the binary information feature vector of the sentence to be recognized comprises:

generating a binary character sequence of the sentence to be recognized;

4. The multi-source information fused word segmentation method according to claim 1, wherein generating the dependency syntax information feature vector of the sentence to be recognized comprises:

5. The multi-source information fusion word segmentation method according to claim 1, wherein after the output layer of the multi-source fusion model outputs the word segmentation result of the sentence to be recognized, the method further comprises:

6. The multi-source information fusion word segmentation method according to claim 5, wherein the modifying the word segmentation result by using a preset custom word list comprises:

7. The multi-source information fusion word segmentation method according to claim 6, wherein the replacement of related words in the word segmentation result according to the candidate set comprises:

judging whether the self-defined word list contains word frequency;

8. The multi-source information fusion word segmentation method according to claim 6, wherein if the custom word list does not contain word frequency, the method further comprises:

determining the phrase with the longest word length in the candidate set;

9. A multi-source information fused word segmentation device is characterized by comprising:

10. An electronic device, comprising: at least one processor and at least one memory; the memory is to store one or more program instructions; the processor, configured to execute one or more program instructions to perform the method of any of claims 1-8.