CN113033196B

CN113033196B - Word segmentation method, device, equipment and storage medium

Info

Publication number: CN113033196B
Application number: CN202110298222.6A
Authority: CN
Inventors: 李�浩; 庞敏辉; 赵志新; 冯婧超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2023-08-15
Anticipated expiration: 2041-03-19
Also published as: CN113033196A

Abstract

The application discloses a word segmentation method, a word segmentation device, word segmentation equipment and a word segmentation storage medium, relates to the technical field of computers, and further relates to artificial intelligence technologies such as deep learning, natural language processing and cloud computing. The specific implementation scheme is as follows: constructing a directed graph of an original sentence by taking words in the original sentence as vertexes; the vertexes in the directed graph have self-acting side relations with the vertexes and bi-directional adjacent side relations with adjacent vertexes; determining vertex characteristic representations of vertices in the directed graph and edge relation characteristic representations of edge relations; and according to the vertex characteristic representation of the vertex in the directed graph and the edge relation characteristic representation of the edge relation, the original sentence is segmented. The application can improve the accuracy of word segmentation results and provides a new idea for word segmentation of sentences.

Description

Word segmentation method, device, equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to deep learning, natural language processing, cloud computing and other technologies.

Background

The word segmentation is an upstream task of tasks such as natural language processing, search engines, intelligent recommendation and the like, and if the word segmentation result is wrong, the accuracy, recall rate and the like of the downstream task can be greatly reduced. Therefore, it is important to improve the accuracy of the word segmentation result.

Disclosure of Invention

The application provides a word segmentation method, a word segmentation device, word segmentation equipment and a storage medium.

According to an aspect of the present application, there is provided a word segmentation method, including:

constructing a directed graph of an original sentence by taking words in the original sentence as vertexes; the vertexes in the directed graph have self-acting side relations with the vertexes and bi-directional adjacent side relations with adjacent vertexes;

determining vertex characteristic representations of vertices in the directed graph and edge relation characteristic representations of edge relations;

and according to the vertex characteristic representation of the vertex in the directed graph and the edge relation characteristic representation of the edge relation, the original sentence is segmented.

According to another aspect of the present application, there is provided a word segmentation apparatus including:

the directed graph construction module is used for constructing a directed graph of an original sentence by taking words in the original sentence as vertexes; the vertexes in the directed graph have self-acting side relations with the vertexes and bi-directional adjacent side relations with adjacent vertexes;

the feature representation determining module is used for determining vertex feature representations of vertexes in the directed graph and edge relation feature representations of edge relations;

and the word segmentation module is used for segmenting the original sentence according to the vertex characteristic representation of the vertex in the directed graph and the edge relation characteristic representation of the edge relation.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the word segmentation method according to any one of the embodiments of the present application.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the word segmentation method according to any one of the embodiments of the present application.

According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the word segmentation method according to any of the embodiments of the present application.

According to the technology provided by the application, the accuracy of the word segmentation result can be improved, and a new thought is provided for word segmentation of sentences.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1A is a flow chart of a word segmentation method provided according to an embodiment of the present application;

FIG. 1B is a schematic diagram of a directed graph of an original sentence provided in accordance with an embodiment of the present application;

FIG. 2 is a flow chart of another word segmentation method provided in accordance with an embodiment of the present application;

FIG. 3 is a flow chart of yet another word segmentation method provided in accordance with an embodiment of the present application;

FIG. 4 is a flow chart of yet another word segmentation method provided in accordance with an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for implementing word segmentation based on a model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a word segmentation device according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device for implementing a word segmentation method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a word segmentation method according to an embodiment of the present application. The embodiment of the application is suitable for the situation of how to word the sentence, in particular for the situation of word segmentation of the pair Wen Yugou. The embodiment may be performed by a word segmentation apparatus configured in an electronic device, which may be implemented in software and/or hardware. As shown in fig. 1, the word segmentation method includes:

s101, constructing a directed graph of the original sentence by taking words in the original sentence as vertexes.

In this embodiment, the original sentence is a sentence that needs word segmentation. Alternatively, the original sentence may be a sentence directly obtained, which is input by the user in the form of text; the term may be a sentence obtained by converting an utterance input by a user in a speech form into a text form.

Alternatively, after the original sentence of the user is obtained, a directed graph of the original sentence may be constructed. Specifically, according to the sequence of words in the original sentence, the words in the original sentence are used as vertexes, adjacent vertexes are linked by a bi-directional adjacent side relationship, each vertex has a self-acting side relationship with itself, and a directed graph of the original sentence is obtained. In this embodiment, for each vertex, the vertex has a bi-directional adjacent edge relationship with an adjacent vertex, specifically including an adjacent edge relationship in which the vertex points to its adjacent vertex, and an adjacent edge relationship in which the adjacent vertex points to the vertex.

For example, the original sentence is "help me charge", each word in the original sentence is used as a vertex, each vertex is linked with its adjacent vertex by a two-way adjacent side relationship, for example, the vertex "help" is linked with the vertex "me" by a two-way adjacent side relationship; moreover, each vertex has a self-acting side relationship with itself, for example, the vertex 'upper' has a self-acting side relationship with itself, and thus a directed graph of 'upper me charging fee' can be obtained as shown in fig. 1B.

S102, determining vertex characteristic representation of vertices in the directed graph and edge relation characteristic representation of edge relation.

In this embodiment, the so-called vertex feature represents a feature capable of characterizing a word associated with a vertex, which may be represented in the form of a vector or a matrix. The side relation comprises a self-acting side relation and a bi-directional adjacent side relation, the side relation characteristic representations corresponding to different side relations can be different, and the side relation characteristic representation of the self-acting side relation is used for representing the characteristic of the self-correlation relation between the vertex and the self-acting side relation and can be represented in a vector or matrix form; the edge-related features of bi-directional adjacent edge relationships represent features that characterize the cross-correlation relationship between a vertex and its adjacent vertices, which may also be represented in the form of vectors or matrices.

Specifically, after the directed graph of the original sentence is constructed, a preset feature extraction algorithm or feature extraction model may be adopted to determine the vertex feature representation of each vertex and the edge relation feature representation of each edge relation in the directed graph. For example, a feature extraction model may be trained in advance, and then after the directed graph of the original sentence is constructed, the directed graph of the original sentence may be input into the feature extraction model, so that the vertex feature representation of each vertex in the directed graph and the edge relationship feature representation of each edge relationship may be directly obtained.

S103, dividing the original sentence according to the vertex characteristic representation of the vertex and the edge relation characteristic representation of the edge relation in the directed graph.

Optionally, after obtaining the vertex feature representation of each vertex and the edge relation feature representation of each edge relation in the directed graph, the vertex feature representations of all vertices and the edge relation feature representations of all edge relations may be input into a pre-trained word segmentation model, so as to obtain a word segmentation result of the original sentence.

According to the technical scheme provided by the embodiment of the application, the original sentence can be segmented by constructing the directed graph of the original sentence and according to the vertex characteristic representation of the vertices and the edge relation characteristic representation of the edge relation in the directed graph of the original sentence. According to the technical scheme, the characteristics of the words in the original sentence and the characteristics between the words can be fully expressed by introducing the directed graph, so that the word segmentation result of the original sentence has higher accuracy, and a foundation is laid for enabling the downstream task to have higher recall rate and accuracy.

If the characters in the original sentence are complex and inconsistent, the final word segmentation result may be inaccurate. Therefore, in order to further improve accuracy of the word segmentation result, as an alternative manner of the embodiment of the present application, before constructing the directed graph of the original sentence by using the word in the original sentence as the vertex, the method may further include: and normalizing the original sentence.

Optionally, the original sentence is normalized under the condition that the phenomenon of non-standardization exists in the original sentence. In this embodiment, the phenomenon of non-standardization may include, but is not limited to, a simplified font, a non-uniform number usage, and the like. Specifically, according to the complex comparison table, the complex characters in the original sentence are converted into simplified characters; and using unified criteria for digits in the original sentence.

It can be appreciated that, before the word segmentation processing is performed on the original sentence, the advantage of performing normalization processing on the original sentence is that error influence caused by non-normalization of the original sentence can be avoided.

Fig. 2 is a flowchart of another word segmentation method according to an embodiment of the present application. The embodiment of the application further explains how to determine the vertex characteristic representation of the vertex and the edge relation characteristic representation of the edge relation in the directed graph on the basis of the embodiment. As shown in fig. 2, the word segmentation method includes:

S201, constructing a directed graph of the original sentence by taking words in the original sentence as vertexes.

Wherein, the vertexes in the directed graph have a self-acting edge relationship with themselves and a bi-directional adjacent edge relationship with adjacent vertexes.

S202, determining vertex characteristic representation of the target vertex according to the occurrence frequency of the labels of the words associated with the target vertex in the directed graph in the sample sentence set.

In this embodiment, the sample sentence set includes a large number of sample sentences, which are sentences labeled with labels of words in advance, that is, each word in the sample sentence is labeled. Further, the label of the word may include three types B, I and O, wherein O is used to represent a word of a single word, a word at a start position of one word is denoted by B, and a word at an intermediate position and an end position of one word is denoted by I. For example, the sample sentence is "energy [ B ] not [ I ] energy [ I ] side [ O ] I [ O ] fill [ O ] and [ B ] Fischer [ I ]".

In this embodiment, each vertex in the directed graph may be sequentially used as the target vertex. For each target vertex, the process of S203 through S204 may be performed to determine a vertex feature representation of the target vertex, and an edge relationship feature representation of the edge relationship associated with the target vertex.

Alternatively, for each target vertex in the directed graph, the word associated with the target vertex may be queried from the sample sentence set, and the occurrence frequency of each label of the word associated with the target vertex may be counted, and then, according to the set label order, the counted occurrence frequencies of all labels of the word associated with the target vertex may be used as the vertex feature representation of the target vertex. For example, the original sentence is "group I charge", assuming that the target vertex in the directed graph of the original sentence is "group", the number of occurrences of "group" tag B in the sample sentence set is 50, the number of occurrences of "group" tag is O is 20, and the number of occurrences of "group" tag is I is 10, and the set tag order is BOI, and further the vertex feature representation of "group" may be [50, 20, 10].

Further, after the occurrence frequency of each label of the word associated with the target vertex is obtained, the occurrence probability of each label of the word associated with the target vertex can be determined according to the occurrence frequency of each label of the word associated with the target vertex; furthermore, the probability of occurrence of all labels of the word associated with the target vertex may be expressed as the vertex characteristics of the target vertex in accordance with the set label order. For example, it may be determined that the occurrence probability of the "help" tag B in the sample sentence set is 50/(50+20+10), i.e., 5/8; the same way can determine that the occurrence probability of the 'upper' label in the sample sentence set is 1/4, and the occurrence probability of the 'upper' label in the sample sentence set is 1/8; furthermore, the vertex characteristic representation of the target vertex "upper" may be [5/8,1/4,1/8].

In addition, the labels of the words in the sample sentence set can be subjected to statistical analysis in advance, and a characteristic representation dictionary can be constructed; and then after constructing the directed graph of the original sentence, directly inquiring words associated with the target vertexes as indexes, and inquiring from the characteristic representation dictionary to obtain the vertex characteristic representations of the target vertexes. Alternatively, a feature representation model may be trained in advance based on the sample sentence set, and then, after constructing the directed graph of the original sentence, the word associated with the target vertex may be input into the feature representation model, and the vertex feature representation of the target vertex may be output.

S203, determining edge relation characteristic representation of adjacent edge relation of adjacent vertexes pointing to the target vertexes according to the label occurrence frequency of words associated with the target vertexes in the sample sentence set under the condition that the adjacent vertexes point to the target vertexes.

In this embodiment, for each target vertex in the directed graph, the target vertex has an adjacent edge relationship pointing to its adjacent vertex, and the adjacent vertex of the target vertex points to the adjacent edge relationship of the target vertex.

Alternatively, for each target vertex, the edge-related feature representation of the adjacent edge relationship that the target vertex's adjacent vertex points to the target vertex may be determined by: obtaining a sample sentence in which the word associated with the target vertex and the word associated with the adjacent vertex of the target vertex jointly appear from a sample sentence set as a candidate sample sentence; inquiring sample sentences of which adjacent vertexes point to the target vertexes from the obtained candidate sample sentences to serve as target sample sentences; counting the occurrence frequency of each label of the word associated with the target vertex in the target sample sentence, and taking the counted occurrence frequency of all labels of the word associated with the target vertex as the edge relation characteristic representation of the adjacent edge relation of the adjacent vertex pointing to the target vertex according to the set label sequence.

For example, the target vertex is "upper" and the adjacent vertex to the target vertex is "me". Further, sample sentences which are commonly appeared in the 'side' and the 'me' can be obtained from the sample sentence set to serve as candidate sample sentences, and sentences from the 'me' to the 'side' in the obtained candidate sample sentences are taken as target sample sentences; and then, counting the occurrence frequency of each label of the 'side' in the target sample sentence, and taking the counted occurrence frequency of all labels of the 'side' as the side relation characteristic representation of the adjacent side relation of the 'I' pointing to the 'side' according to the set label sequence.

Further, after the occurrence frequency of each label of the word associated with the target vertex is obtained, the occurrence probability of each label of the word associated with the target vertex can be determined according to the occurrence frequency of each label of the word associated with the target vertex; furthermore, the occurrence probability of all labels of the word associated with the target vertex can be used as the edge relation characteristic representation of the adjacent edge relation of the adjacent vertex pointing to the target vertex according to the set label sequence.

Alternatively, for the edge relationship feature that the target vertex points to the adjacent edge relationship with its adjacent vertex, S203 may be used for calculation with the adjacent vertex of the target vertex as the target vertex; or may be calculated under the target vertex. Specifically, when the target vertex points to the adjacent vertex of the target vertex, determining the edge relation characteristic representation of the adjacent edge relation of the target vertex points to the adjacent vertex according to the label occurrence frequency of the word associated with the adjacent vertex of the target vertex in the sample statement set.

S204, determining the edge relation characteristic representation of the self-acting edge relation of the target vertex according to the preset vector.

It should be noted that, the self-acting edge relationship of each target vertex is used to represent the self-correlation relationship of the vertex and itself, and there is no difference in function angle, so that for the sake of calculation, the edge relationship characteristic representation of the self-acting edge relationship of all the target vertices is set to be the same in this embodiment.

For example, the preset vector may be represented as an edge relationship feature of the self-acting edge relationship of each target vertex. The dimension of the preset vector is the same as the dimension of the vertex feature representation, and may specifically be a three-dimensional row vector, for example [1, 1].

It should be noted that the operation of S205 may be performed after the vertex characteristic representation of each vertex and the edge relationship characteristic representation of each edge relationship in the directed graph of the original sentence are determined.

S205, the original sentence is segmented according to the vertex characteristic representation of the vertices in the directed graph and the edge relation characteristic representation of the edge relation.

It should be noted that, in the word segmentation method based on deep learning at present, a word vector is generally determined by embedding (i.e. enabling) a word in a corpus to obtain a word vector with a longer dimension (e.g. 200); the word vector dimension is longer, so that a model is complex, the performance is poor, and the like.

In the embodiment, the label of the word is labeled for the sample sentence in advance by introducing the label 'BOI', and the vertex characteristic representation and the side relation characteristic representation are obtained in a statistical mode.

According to the technical scheme, the vertex characteristic representation of the target vertex can be determined based on the labels of the characters in the sample sentence set by adopting a statistical method, compared with the conventional character vector adopted in the prior art, the dimension of the characteristic representation is reduced, and even if deep learning is used for word segmentation, the accuracy can be ensured, the performance of a model can be improved, and the like, so that a new thought is provided for determining the vertex characteristic representation of the vertex and the edge relation characteristic representation of the edge relation.

Fig. 3 is a flowchart of yet another word segmentation method according to an embodiment of the present application. On the basis of the embodiment, the embodiment of the application further explains how to segment the original sentence according to the vertex characteristic representation of the vertex and the edge relation characteristic representation of the edge relation in the directed graph. As shown in fig. 3, the word segmentation method includes:

S301, constructing a directed graph of the original sentence by taking words in the original sentence as vertexes.

S302, determining vertex characteristic representation of vertices in the directed graph and edge relation characteristic representation of edge relation.

S303, determining labels of words in the original sentence according to vertex characteristic representations of vertices in the directed graph and edge relation characteristic representations of edge relations.

Optionally, in the case that the vertex feature representation of the vertex in the directed graph and the edge relationship feature representation of the edge relationship are determined based on the label occurrence frequency of the words in the sample sentence set, the vertex feature representation of the vertex in the directed graph and the edge relationship feature representation of the edge relationship may be input into a pre-trained label determination model, and the probability that each word in the original sentence belongs to a different label (i.e., the label of each word is B, O, I respectively) may be output; and further, for each word in the original sentence, determining the label of the word according to the probability that the word belongs to different labels. For example, a label corresponding to the highest probability among probabilities that the word belongs to different labels may be used as the label of the word. For example, the probabilities of the "upper" tag being B, O, I are 0.3, 0.6 and 0.1, respectively, and O can be used as the "upper" tag.

Further, in the case where the vertex characteristic representation of the vertex and the edge relation characteristic representation of the edge relation are determined based on the label occurrence frequency of the words in the sample sentence set in step S302, the determined vertex characteristic representation of the vertex may be regarded as the initial vertex characteristic representation of the vertex.

Furthermore, to improve the accuracy of the word segmentation result, the vertex feature representation of the vertex determined in step S302 may be updated first, and then, according to the updated vertex feature representation of the vertex, the label of the word in the original sentence may be determined. Optionally, each vertex in the directed graph may be sequentially used as a target vertex, and for each target vertex, the vertex feature representation of the target vertex may be updated according to the vertex feature representation of the target vertex, the vertex feature representations of adjacent vertices of the target vertex, and the edge relationship feature representations of the edge relationship associated with the target vertex in the directed graph. The edge relation associated with the target vertex comprises a self-acting edge relation of the target vertex and an adjacent edge relation that an adjacent vertex of the target vertex points to the target vertex.

S304, word segmentation is carried out on the original sentence according to the labels of the words in the original sentence.

Optionally, after determining the label of each word in the original sentence, the original sentence may be segmented according to the label of the word in the original sentence based on the meaning represented by the label. For example, the original sentence is "group me charge", by executing steps S301 to S303, the label of "group me" is O, the label of "charge" is O, the label of "phone" is B, and the label of "charge" is I, and further, based on the meaning represented by the label, the original sentence can be segmented, and the result of the segmentation is that the "group", "me" and "charge" are words of a single word, and the "charge" is a word.

It should be noted that, in this embodiment, a label of a word in an original sentence is introduced, and word segmentation of the original sentence is implemented based on the label, so that an optional way is provided for implementing word segmentation of the original sentence.

According to the technical scheme provided by the embodiment of the application, the labels of the words in the original sentence can be accurately determined according to the vertex characteristic representation of the vertices in the directed graph of the original sentence and the edge relation characteristic representation of the edge relation, so that word segmentation of the original sentence can be realized based on the labels of the words in the original sentence, and the accuracy of word segmentation results is improved.

Fig. 4 is a flowchart of still another word segmentation method according to an embodiment of the present application. On the basis of the embodiment, the embodiment of the application further explains how to segment the original sentence according to the vertex characteristic representation of the vertex and the edge relation characteristic representation of the edge relation in the directed graph. As shown in fig. 4, the word segmentation method includes:

s401, constructing a directed graph of the original sentence by taking words in the original sentence as vertexes.

S402, determining vertex characteristic representations of vertices in the directed graph and edge relation characteristic representations of edge relations.

S403, updating the vertex characteristic representation of the target vertex according to the vertex characteristic representation of the target vertex, the vertex characteristic representations of the adjacent vertexes of the target vertex and the side relation characteristic representation of the side relation associated with the target vertex in the directed graph.

In this embodiment, the vertex feature representation of the vertex and the edge relationship feature representation of the edge relationship in step S402 are preferably determined based on the label occurrence frequency of the words in the sample sentence set. To further improve the accuracy of the word segmentation result, each vertex in the directed graph of the present embodiment may be used as a target vertex, and for each target vertex, the process of S403 may be performed to update the vertex feature representation of the target vertex. Further, after the vertex feature representation of each target vertex in the directed graph is updated, step S404 may be performed.

Optionally, for each target vertex i in the directed graph, the own feature information is sent to other surrounding vertices j along the adjacent side relationship, so that the vertex feature representation of the target vertex can be updated according to the feature information received by the target vertex from other surrounding vertices, the feature information of the own vertex, and the like. Specifically, fn may be expressed according to the vertex characteristics of the target vertex i _i ^(l) Vertex characteristics of adjacent vertex j of target vertex i represent fn _j ^(l) Adjacent vertex j of target vertex i points to adjacent edge relation e of target vertex i _ji Is expressed as fe _ji ^(l) Self-acting edge relationship e of target vertex i _ii Is expressed as fe _ii ^(l) To update the vertex feature representation of the target vertex i. Where l represents the dimension of the feature representation, in this embodiment, depending on the type of tag, for example, l has a value of 3.

For each target vertex i, the vertex feature representation of that target vertex may be updated, for example, by:

and step A, determining the self-transmission characteristic of the target vertex according to the vertex characteristic representation of the target vertex and the edge relation characteristic representation of the self-acting edge relation of the target vertex.

In this embodiment, the self-transmission feature is a feature that the target vertex itself transmits to itself. Specifically, the vertex characteristics of the target vertex i may be expressed as fn _i ^(l) Self-acting edge relationship e with target vertex i _ii Is expressed as fe _ii ^(l) And adding, and taking the added result as the self-transmission characteristic of the target vertex i. Alternatively, the vertex characteristics of the target vertex i may be expressed as fn _i ^(l) Vertex weight wn associated with target vertex i _i ^(l) Multiplying and relating the self-acting edges e of the target vertices i _ii Is expressed as fe _ii ^(l) Side relationship weights we associated with self-acting side relationships _ii ^(l) Multiply and add the multiplication results of the two, i.e. wn _i ^(l) *fn _i ^(l) +we _ii ^(l) *fe _ii ^(l) As a self-propagation feature of the target vertex i. The vertex weights and the side relationship weights can be obtained by training based on sample sentences in a sample sentence set in advance, further, the side relationship weights of the self-acting side relationships of each vertex can be the same, the vertex weights of each vertex can be the same or different, and the side relationship weights of the two-way adjacent side relationships of each vertex can be different or different.

And B, determining the transmission characteristics of the adjacent vertexes to the target vertexes according to the vertex characteristic representation of the adjacent vertexes and the edge relation characteristic representation of the adjacent edge relation of the adjacent vertexes pointing to the target vertexes.

In this embodiment, the transfer feature is a feature transferred from an adjacent vertex of the target vertex to the target vertex. Specifically, the vertex characteristics of the adjacent vertex j of the target vertex i may be expressed as fn _j ^(l) Adjacent edge relationship e with adjacent vertex j of target vertex i pointing to target vertex i _ji Is expressed as fe _ji ^(l) Adding and orienting the added result as the adjacent vertex j of the target vertex i The transfer characteristics transmitted by the target vertex i. Alternatively, the vertex characteristics of the adjacent vertex j of the target vertex i may be expressed as fn _j ^(l) Vertex weights wn associated with the neighboring vertices j _j ^(l) Multiplying and directing adjacent vertex j of target vertex i to adjacent edge relationship e of target vertex i _ji Is expressed as fe _ji ^(l) Side relationship weights we associated with the adjacent side relationship _ji ^(l) Multiply and add the multiplication results of the two, i.e. wn _j ^(l) *fn _j ^(l) +we _ji ^(l) *fe _ji ^(l) And the adjacent vertex j serving as the target vertex i transmits the feature to the target vertex i.

For example, the original sentence is "help-me charge", and if the target vertex is "me", the adjacent vertices of the target vertex include vertex "help" and vertex "charge". Multiplying the vertex characteristic representation of the vertex 'upper' by the vertex weight, multiplying the side relation characteristic representation of the adjacent side relation of the vertex 'upper' pointing to the vertex 'I' by the side relation weight, and adding the multiplication results of the side relation characteristic representation and the side relation weight to obtain the transmission characteristic of the vertex 'upper' to the vertex 'I'; similarly, the transfer characteristic of vertex "fill" to vertex "me" can be obtained.

And step C, updating the vertex characteristic representation of the target vertex according to the transmission characteristic, the attention coefficient associated with the transmission characteristic and the attention coefficient associated with the self-transmission characteristic.

Wherein the attention coefficient is used to characterize the importance of the information. In this embodiment, the attention coefficient associated with the transfer feature is used to characterize the importance of the transfer feature from the adjacent vertex to the target vertex, that is, the importance of the adjacent vertex to the target vertex, and optionally, the transfer features from different adjacent vertices to the target vertex may be different, so that the attention coefficient associated with the transfer feature may also be different; correspondingly, the attention coefficient associated with the self-transmission feature is used for representing the importance of the self-transmission feature of the target vertex to the self, namely representing the importance of the target vertex to the self.

Optionally, aFor each target vertex, after determining the self-passing feature of the target vertex and the transfer feature of each adjacent vertex of the target vertex to the target vertex, the self-passing feature and the attention coefficient score of the self-passing feature may be correlated _ii Multiplying, and associating each transfer feature with an attention coefficient score of the transfer feature _ji Multiplying and adding all multiplication results, i.e.(wherein N is the set of adjacent vertexes of the target vertexes), and then the updated vertex characteristic representation of the target vertexes can be obtained according to the addition result. For example, the addition result may be processed by using a sigmoid activation function, so as to obtain the vertex feature representation of the updated target vertex i.

For example, if the original sentence is "group me charge", and if the target vertex is "me", the adjacent vertices of the target vertex include a vertex group and a vertex charge, the transfer feature of the vertex group to the vertex group, the transfer feature of the vertex charge to the vertex group, and the self-transfer feature of the vertex group may be determined by executing S403; then multiplying the self-transmission characteristic of the vertex I and the associated attention coefficient thereof, multiplying the transmission characteristic of the vertex I from the upper to the vertex I and the associated attention coefficient thereof, multiplying the transmission characteristic of the vertex I from the top to the vertex I and the associated attention coefficient thereof, adding the multiplication results of the three, and processing the addition result by adopting a sigmoid activation function to obtain the updated vertex characteristic representation of the vertex I.

For each target vertex, the vertex feature representation of the vertex to which the updated target vertex relates and the variable relationship feature representation of the edge relationship are not updated. Further, in order to increase the response rate, in this embodiment, the vertex feature representations of all vertices in the directed graph may be updated in parallel.

Optionally, the process of performing steps a to C once may be completed to update the vertex characteristic representation of the target vertex once; furthermore, in order to improve word segmentation accuracy, in the embodiment, under the condition of comprehensively considering calculation complexity, response speed and the like, the accuracy of word segmentation results after three times of updating of vertex characteristic representations of target vertices is higher by training a sample sentence set.

It will be appreciated that, to ensure orderly execution, each time an update of the vertex feature representation is performed, all vertices in the directed graph are updated the next time the vertex feature representation is updated.

S404, determining the labels of the words in the original sentence according to the updated vertex characteristic representation of the target vertex.

Optionally, after obtaining the updated vertex feature representation of each target vertex, mapping the updated vertex feature representation of each target vertex with a full-connection layer, so that the full-connection layer outputs the probability that each word in the original sentence belongs to different labels, namely, a three-dimensional row vector; and further, based on the probability that each word in the original sentence belongs to different labels, the label of each word in the original sentence can be obtained. For example, for each word of the original sentence, a label corresponding to the highest probability among probabilities that the word belongs to different labels may be used as the label of the word.

S405, word segmentation is carried out on the original sentence according to the labels of the words in the original sentence.

According to the technical scheme, the vertex characteristic representation of the target vertex, the vertex characteristic representation of the adjacent vertex of the target vertex and the side relation characteristic representation of the side relation associated with the target vertex in the directed graph of the original sentence are updated, and further the accuracy of the label of the word in the original sentence is further improved based on the updated vertex characteristic representation, so that a foundation is laid for accurately segmenting the original sentence; meanwhile, based on the labels of the words in the original sentences, the accurate word segmentation of the original sentences can be realized.

Alternatively, on the basis of the above embodiment, as an alternative manner of the embodiment of the present application, the attention coefficient associated with the transmission feature and the attention coefficient associated with the self-transmission feature may be determined according to the vertex feature representation of the target vertex and the vertex feature representation of the adjacent vertex.

Specifically, for each target vertex in the directed graph, the vertex feature representation of each adjacent vertex of the target vertex may be respectively stitched with the vertex feature representation of the target vertex, and the vertex feature representation of the target vertex may be stitched with the vertex feature representation of the target vertex; then, for each stitching feature, through the leskyrelu activation function, the original attention score between the vertices corresponding to the stitching feature can be obtained. For example, for any adjacent vertex of the target vertex, the vertex feature representation of the adjacent vertex and the vertex feature representation of the target vertex may be stitched to obtain the stitching feature, and the lesky relu activation function may be used to obtain the original attention score between the adjacent vertex and the target vertex.

Then, a softmax operation is performed on all the original attention scores to obtain the attention coefficients between the vertexes, namely the attention coefficients between the target vertexes (namely the attention coefficients associated with the self-transmission characteristics) and the attention coefficients between each adjacent vertex of the target vertexes and the target vertexes (namely the attention coefficients associated with the transmission characteristics).

Further, in this embodiment, the attention coefficient may be dynamically updated along with the update of the vertex feature representation. That is, each time the vertex feature representation is updated, the attention coefficient is updated accordingly, so as to increase the accuracy of the final updated vertex feature representation, and further increase the accuracy of the word segmentation result.

It will be appreciated that the present embodiment further increases the accuracy of the word segmentation result by determining the attention coefficient based on the vertex feature representation, enabling the attention coefficient to be associated with the vertex feature representation, i.e. the attention coefficient can be varied with dynamic variation of the vertex feature representation.

On the basis of the embodiment, the embodiment of the application provides a method for realizing word segmentation based on a model, wherein a schematic diagram of the model structure is shown in fig. 5. The specific word segmentation process is as follows:

The GAT model is first trained. And distributing the data in the sample statement set according to a set proportion (such as 8:1:1) to obtain a training set, a verification set and a test set.

For each sample sentence in the training set, constructing a directed graph of the sample sentence, and determining vertex characteristic representation of each vertex in the sample sentence and edge relation characteristic representation of an edge relation based on a statistical method; the graph annotation force network (Graph Attention Networks, GAT) is then trained using the vertex feature representation and the side relationship feature representation of each sample sentence in the training set.

In this embodiment, the algorithms in the graph attention network include algorithms that update the vertex feature representation such as:related algorithms to determine the attention coefficients, etc. may also be included. Where σ represents the sigmoid activation function. Further, the training of the graph meaning force network is essentially to train the vertex weights such as wn related by the algorithm in the graph meaning force network _i ^(l) And wn _j ^(l) Side relation weights such as we _ii ^(l) And we _ji ^(l) And a process of training by determining parameters and the like involved in the attention coefficient. The method comprises the steps of performing cross entropy operation by using supervision data of sample sentences in a training set and output results of the sample sentences of a graph-note meaning network, determining loss, performing back propagation by using the loss, and training parameters in the attention network to obtain a GAT model.

Thereafter, the model may be validated and tested using the validation set and the test set, respectively, to optimize parameters in the model.

Furthermore, in the process of training the graph annotation meaning network, the accuracy of the word segmentation result obtained after three times of updating of the vertex characteristic representation is higher under the condition of comprehensively considering the calculation complexity, the response speed and the like. That is, the GAT model in this embodiment includes three GAT layers and one output layer, where the output layer is a fully connected layer.

After the original sentence is obtained, a directed graph of the original sentence can be constructed, a statistical method can be adopted, and the vertex characteristic representation of each vertex and the edge relation characteristic representation of each edge relation in the original sentence are determined based on the labels of the words in the sample sentence set; the vertex feature representations of all vertices and the edge relationship feature representations of all edge relationships may then be input to the GAT model.

In connection with fig. 5, assuming that the original sentence is "help-me charge", after the vertex feature representations of all vertices and the edge relationship feature representations of all edge relationships are input into the GAT model, each GAT layer in the GAT model updates the vertex feature representations of all vertices once. The method comprises the steps that a first GAT layer updates vertex characteristic representations of each vertex in an original sentence determined based on labels of words in a sample sentence set, and the vertex characteristics of the updated vertices are transmitted to a second GAT layer; updating the vertex characteristic representation again by the second GAT layer according to the edge relation characteristic representation of the edge relation and the vertex characteristic representation of the vertex updated by the first GAT layer, and transmitting the vertex characteristic representation of the updated vertex to a third GAT; and the third GAT layer updates the vertex characteristic representation again according to the edge relation characteristic representation of the edge relation and the vertex characteristic representation of the vertex updated by the second GAT layer, outputs the vertex characteristic representation of the updated vertex to the output layer, and outputs the probability of each word belonging to different labels in the original sentence by the output layer.

Notably, the edge relationship feature representation of the edge relationship is not updated throughout the GAT model, i.e., the edge relationship feature representations of the same edge relationship in the first GAT layer, the second GAT layer, and the third GAT layer are the same. For example, the edge relationship features of the vertex "upper" pointing to the vertex "me" are the same in the three GAT layers.

It can be understood that, compared with the existing word segmentation method based on rules, the word segmentation method based on the graph attention network does not need to provide a special dictionary during word segmentation, thereby reducing labor cost; furthermore, the embodiment of the application does not need to use the content of the special field, so that the method can be conveniently moved from the general field to other fields.

In addition, compared with the existing method for realizing word segmentation by deep learning, the method provided by the embodiment of the application has the advantages that the dimension of the vertex characteristic representation and the dimension of the side relation characteristic representation determined based on the statistical method are smaller, so that the complexity of a model is greatly reduced; meanwhile, on the basis of guaranteeing and even improving the performance of the model, the time required by model prediction is reduced.

Fig. 6 is a schematic structural diagram of a word segmentation device according to an embodiment of the present application. The embodiment of the application is suitable for the situation of how to word the sentence, in particular for the situation of word segmentation of the pair Wen Yugou. The device can be implemented by software and/or hardware, and can implement the word segmentation method according to any embodiment of the application. As shown in fig. 6, the word segmentation apparatus includes:

The directed graph construction module 601 is configured to construct a directed graph of an original sentence with words in the original sentence as vertices; the vertexes in the directed graph have self-acting edge relations with themselves and bi-directional adjacent edge relations with adjacent vertexes;

a feature representation determination module 602 for determining a vertex feature representation of a vertex in the directed graph and an edge relationship feature representation of an edge relationship;

the word segmentation module 603 is configured to segment the original sentence according to the vertex feature representation of the vertices in the directed graph and the edge relationship feature representation of the edge relationship.

Illustratively, the feature representation determination module 602 is specifically configured to:

determining vertex characteristic representation of a target vertex according to the occurrence frequency of labels of words associated with the target vertex in the directed graph in the sample statement set;

Determining edge relation characteristic representation of adjacent edge relation of adjacent vertexes pointing to the target vertexes according to the occurrence frequency of labels of words associated with the target vertexes in the sample sentence set under the condition that the adjacent vertexes of the target vertexes point to the target vertexes;

and determining the edge relation characteristic representation of the self-acting edge relation of the target vertex according to the preset vector.

Illustratively, the word segmentation module 603 includes:

the label determining unit is used for determining labels of words in the original sentence according to vertex characteristic representations of vertexes in the directed graph and edge relation characteristic representations of edge relations;

the word segmentation unit is used for segmenting the original sentence according to the label of the word in the original sentence.

The tag determination unit includes:

a feature representation updating subunit, configured to update the vertex feature representation of the target vertex according to the vertex feature representation of the target vertex, the vertex feature representations of the adjacent vertices of the target vertex, and the edge relationship feature representation of the edge relationship associated with the target vertex in the directed graph;

and the label determining subunit is used for determining labels of words in the original sentence according to the updated vertex characteristic representation of the target vertex.

Illustratively, the feature representation update subunit is specifically configured to:

Determining self-transmission characteristics of the target vertex according to vertex characteristic representation of the target vertex and edge relation characteristic representation of self-acting edge relation of the target vertex;

determining the transmission characteristics of the adjacent vertexes to the target vertexes according to the vertex characteristic representation of the adjacent vertexes and the side relation characteristic representation of the adjacent side relation of the adjacent vertexes pointing to the target vertexes;

and updating the vertex characteristic representation of the target vertex according to the transmission characteristic and the attention coefficient related to the self-transmission characteristic.

Illustratively, the apparatus further comprises:

and the attention coefficient determining module is used for determining attention coefficients associated with the transmission characteristics and attention coefficients associated with the self-transmission characteristics according to the vertex characteristic representation of the target vertex and the vertex characteristic representation of the adjacent vertex.

Illustratively, the apparatus further comprises:

the preprocessing module is used for carrying out standardization processing on the original sentences.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, the word segmentation method. For example, in some embodiments, the word segmentation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the word segmentation method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the word segmentation method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A word segmentation method, comprising:

according to the vertex characteristic representation of the vertex in the directed graph and the edge relation characteristic representation of the edge relation, the original sentence is segmented;

Wherein the determining the vertex feature representation of the vertices and the edge relationship feature representation of the edge relationship in the directed graph comprises:

determining vertex characteristic representation of the target vertex according to the occurrence frequency of the labels of the words associated with the target vertex in the directed graph in the sample sentence set;

determining edge relation characteristic representation of adjacent edge relation of the adjacent vertex pointing to the target vertex according to the label occurrence frequency of words associated with the target vertex in a sample sentence set under the condition that the adjacent vertex pointing to the target vertex;

and determining the edge relation characteristic representation of the self-acting edge relation of the target vertex according to a preset vector.

2. The method of claim 1, wherein the segmenting the original sentence according to the vertex feature representation of the vertices and the edge relationship feature representation of the edge relationships in the directed graph comprises:

determining labels of words in the original sentence according to vertex characteristic representations of vertexes in the directed graph and edge relation characteristic representations of edge relations;

and according to the labels of the words in the original sentences, segmenting the words in the original sentences.

3. The method of claim 2, wherein the determining the label of the word in the original sentence from the vertex feature representation of the vertices in the directed graph and the edge relationship feature representation of the edge relationship comprises:

Updating the vertex characteristic representation of the target vertex according to the vertex characteristic representation of the target vertex, the vertex characteristic representation of the adjacent vertex of the target vertex and the edge relationship characteristic representation of the edge relationship associated with the target vertex in the directed graph;

and determining the label of the word in the original sentence according to the updated vertex characteristic representation of the target vertex.

4. A method according to claim 3, wherein said updating said vertex feature representation of a target vertex in said directed graph based on a vertex feature representation of a target vertex, a vertex feature representation of an adjacent vertex to said target vertex, and an edge relationship feature representation of an edge relationship associated with said target vertex comprises:

determining self-transmission characteristics of the target vertex according to the vertex characteristic representation of the target vertex and the edge relationship characteristic representation of the self-acting edge relationship of the target vertex;

determining the transmission characteristics of the adjacent vertexes to the target vertexes according to the vertex characteristic representation of the adjacent vertexes and the edge relation characteristic representation of the adjacent edge relation of the adjacent vertexes pointing to the target vertexes;

5. The method of claim 4, further comprising:

and determining the attention coefficient associated with the transmission characteristic and the attention coefficient associated with the self-transmission characteristic according to the vertex characteristic representation of the target vertex and the vertex characteristic representation of the adjacent vertex.

6. The method of claim 1, wherein before constructing the directed graph of the original sentence with the words in the original sentence as vertices, further comprising:

and normalizing the original sentence.

7. A word segmentation apparatus comprising:

the word segmentation module is used for segmenting the original sentence according to the vertex characteristic representation of the vertex in the directed graph and the edge relation characteristic representation of the edge relation;

wherein, the characteristic representation determining module is specifically configured to:

8. The apparatus of claim 7, wherein the word segmentation module comprises:

and the word segmentation unit is used for segmenting the original sentence according to the label of the word in the original sentence.

9. The apparatus of claim 8, wherein the tag determination unit comprises:

a feature representation updating subunit, configured to update a vertex feature representation of a target vertex in the directed graph according to a vertex feature representation of the target vertex, a vertex feature representation of a neighboring vertex of the target vertex, and an edge relationship feature representation of an edge relationship associated with the target vertex;

And the label determining subunit is used for determining the label of the word in the original sentence according to the updated vertex characteristic representation of the target vertex.

10. The apparatus of claim 9, wherein the feature representation updating subunit is specifically configured to:

11. The apparatus of claim 10, further comprising:

and the attention coefficient determining module is used for determining the attention coefficient associated with the transmission characteristic and the attention coefficient associated with the self-transmission characteristic according to the vertex characteristic representation of the target vertex and the vertex characteristic representation of the adjacent vertex.

12. The apparatus of claim 7, further comprising:

and the preprocessing module is used for carrying out standardization processing on the original sentence.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the word segmentation method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the word segmentation method according to any one of claims 1-6.