CN111460834A

CN111460834A - French semantic annotation method and device based on L STM network

Info

Publication number: CN111460834A
Application number: CN202010273691.8A
Authority: CN
Inventors: 莫同; 李雨萌; 骆旭辉; 刘亚亭; 张艺璇
Original assignee: Beijing Peking University Software Engineering Co ltd
Current assignee: Beijing Peking University Software Engineering Co ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-07-28
Anticipated expiration: 2040-04-09
Also published as: CN111460834B

Abstract

The invention relates to a method and a device for labeling legal entry semantics based on an L STM network, which comprises the steps of obtaining a text to be analyzed, analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting the words into D-dimension word vectors, inputting the D-dimension word vectors into a fully-connected neural network to obtain feature codes, comparing the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the text in a preset database to obtain a best-matched text, obtaining a final vector representation, inputting the final vector representation into the fully-connected neural network, and outputting semantic role labels of each word in the text to be analyzed.

Description

French semantic annotation method and device based on L STM network

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a French semantic annotation method and device based on L STM network.

Background

The existing methods for shallow semantic analysis such as semantic role labeling mostly need to combine with a certain degree of syntactic analysis or manually extracted features, and in the process of semantic analysis, certain error rate exists in the syntactic analysis, so that errors occur in the subsequent semantic analysis result. Semantic role labeling tasks in natural language processing have a plurality of technical problems. In recent years, with the rapid development of deep learning technology, the semantic role labeling effect of English and Chinese is greatly improved, and a good effect is achieved on data sets in multiple language fields.

However, as the number of cases and laws is increased in the judicial field, great pressure is brought to personnel working in relevant works of laws, even professional lawyers are difficult to be familiar with all legal laws, and a great amount of time and energy are needed in the process of acquiring relevant contents of cases from a large amount of legal texts, and the working efficiency is low. Therefore, assisting the work of the relevant practitioners through artificial intelligence becomes an urgent problem to be solved.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for labeling legal entry semantics based on an L STM network to solve the problems in the prior art that a large amount of time and effort are required to obtain case-related content from a large amount of legal texts and the work efficiency is low.

In order to achieve the purpose, the invention adopts the following technical scheme that the French semantic annotation method based on L STM network comprises the following steps:

acquiring a text and preprocessing the text to acquire the text to be analyzed;

analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimensional word vectors by adopting a word vector model, and inputting all the D-dimensional word vectors into a fully-connected neural network to obtain the feature codes of all the words;

comparing part-of-speech labels of the text to be analyzed with part-of-speech labels of texts in a preset database to obtain a best matching text in the preset database, and vectorizing semantic role labels of the best matching text and position information corresponding to the semantic role labels to obtain a feature vector;

compounding the feature codes and the feature vectors to obtain final vector representation;

and inputting the final vector representation into a fully-connected neural network, and outputting the semantic role labels of each word in the text to be analyzed.

Further, the acquiring a text and preprocessing the text to acquire a text to be analyzed includes:

carrying out standardized processing on the text to obtain a text to be analyzed in a standard data input form; the standard text to be analyzed in the data input form is a text with a specified central predicate.

Further, the center predicate includes:

administrative subject, administrative relatives, time, place.

Further, the analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech tags corresponding to the words includes:

splitting the text to be analyzed according to a legal dictionary by adopting a Chinese word segmentation tool and a part of speech tagging tool;

and acquiring all words of the analysis text and part-of-speech labels corresponding to the words.

Further, inputting all the D-dimensional word vectors into a fully-connected neural network to obtain feature codes of all the words, including:

sequentially inputting all the D-dimensional word vectors into a fully-connected neural network, wherein the fully-connected neural network is provided with a feature encoder, and the feature encoder comprises a bidirectional L STM with 4 layers of stacks, and comprises a first layer L STM, a second layer L STM, a third layer L STM and a fourth layer L STM;

the first layer L STM is encoded with the D-dimensional word vectors as input, then the input to each layer L STM is the output of the previous layer, and the fourth layer L STM outputs feature encoding.

Further, the comparing the part-of-speech tag of the text to be analyzed with the part-of-speech tag of the text in a preset database to obtain the best matching text in the preset database includes:

matching character strings to two sides by taking the central predicate as the center of part-of-speech labels of the text to be analyzed and part-of-speech labels of the text in a preset database;

and calculating the matching degree according to the matching length of the character string to obtain the best matching text.

Further, the semantic role labels of the best matching texts and the position information corresponding to the semantic role labels are vectorized to obtain feature vectors,

vectorizing the semantic role annotation of the best matching text to obtain a first vector representation;

vectorizing the distance between the semantic role label and the central predicate to obtain a second vector representation;

the first vector representation and the second vector representation are combined into a feature vector.

Further, the inputting the final vector representation into a fully-connected neural network and outputting a semantic role label of each word in the text to be analyzed includes:

inputting the final vector into a fully-connected neural network, wherein a softmax layer is arranged in the fully-connected neural network, semantic role labeling is carried out on each word by adopting a softmax classifier on the softmax layer, and the softmax layer outputs the semantic role labeling.

Further, the word vector model includes:

word2vec language model, glove language model, or BERT language model.

The embodiment of the application provides a french semantic annotation device based on L STM network, includes:

the preprocessing module is used for acquiring a text and preprocessing the text to acquire the text to be analyzed;

the first processing module is used for analyzing and processing the text to be analyzed so as to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimensional word vectors by adopting a word vector model, and inputting all the D-dimensional word vectors into a fully-connected neural network to obtain feature codes of all the words;

the second processing module is used for comparing part-of-speech labels of the text to be analyzed with part-of-speech labels of texts in a preset database to obtain a best matching text in the preset database, and vectorizing semantic role labels of the best matching text and position information corresponding to the semantic role labels to obtain a feature vector;

an obtaining module, configured to compound the feature code and the feature vector to obtain a final vector representation;

and the output module is used for inputting the final vector representation into a fully-connected neural network and outputting the semantic role labels of each word in the text to be analyzed.

By adopting the technical scheme, the invention can achieve the following beneficial effects:

the method comprises the steps of firstly vectorizing a French text and predicting part-of-speech tagging results, secondly calculating the most similar French in a database based on the part-of-speech tagging results to obtain vector representation of semantic character tagging of the French, and finally inputting data into an L STM network to obtain semantic character tagging of each word.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating steps of a French semantic annotation method based on L STM network according to the present invention;

FIG. 2 is a schematic structural diagram of a French semantic annotation device based on L STM network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

A specific French semantic annotation method based on the L STM network provided in the embodiment of the present application is described below with reference to the accompanying drawings.

As shown in fig. 1, the method for labeling French semantics based on L STM network provided in this embodiment of the present application includes:

s101, acquiring a text and preprocessing the text to acquire the text to be analyzed;

the method is mainly applied to a worker to look up legal provisions, firstly, the French text is obtained, and the French text is preprocessed, wherein the preprocessing is to standardize the text, the text is processed to obtain a standard data input form, namely, a center predicate in each input text is appointed, and the text with the appointed center predicate is the text to be analyzed.

S102, analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech labels corresponding to the words, converting all the words into D-dimensional word vectors by adopting a word vector model, and inputting all the D-dimensional word vectors into a fully-connected neural network to obtain feature codes of all the words;

the method comprises the steps of splitting an obtained text to be analyzed into words, namely splitting the text into a plurality of words, simultaneously forming part-of-speech labels corresponding to the words, vectorizing the words by adopting a word vector model, converting the words into D-dimensional word vectors, inputting the D-dimensional word vectors into a fully-connected neural network for training, and obtaining feature codes of the words. The D-dimensional word vector represents a chinese word using a vector with dimension D.

The word recognition method comprises the following steps of obtaining a word vector model, and carrying out word recognition on the word vector model, wherein the word vector model is obtained by using a conventional word vector model, and the word vector model is not required to be subjected to special requirements.

S103, comparing part-of-speech labels of the text to be analyzed with part-of-speech labels of the text in a preset database to obtain a best matching text in the preset database, and vectorizing semantic role labels of the best matching text and position information corresponding to the semantic role labels to obtain a feature vector;

the method comprises the steps of presetting a database, arranging a French text in the database, carrying out part-of-speech tagging on a French in the database, comparing the part-of-speech tagging in a given text to be analyzed with the part-of-speech tagging on the French text in the database, finding out the French text in the database with the highest matching degree as the best matching text, vectorizing position information corresponding to semantic role tagging and semantic role tagging in the best matching text, and obtaining a feature vector.

S104, compounding the feature codes and the feature vectors to obtain final vector representation;

and (5) splicing and compounding the feature code of each word obtained in the step (S102) and the feature vector of the word corresponding to the best matching text obtained in the step (S103) to obtain final vector representation.

And S105, inputting the final vector representation into a fully-connected neural network, and outputting the semantic role label of each word in the text to be analyzed.

And inputting the final vector identification into the fully-connected neural network, and identifying and finally outputting semantic role labels of each word in the text to be analyzed through a softmax classifier in the fully-connected neural network.

The title semantic annotation method based on L STM network comprises vectorizing title text and predicting part-of-speech annotation result, calculating the most similar title text in database based on part-of-speech annotation result to obtain vector representation of title semantic role annotation, and inputting data into L STM network to obtain semantic role annotation of each word.

In some embodiments, obtaining a text and preprocessing the text to obtain a text to be analyzed includes:

carrying out standardized processing on the text to obtain a text to be analyzed in a standard data input form; the standard text to be analyzed in the form of data input is the text with a specified central predicate.

Preferably, the central predicate includes:

administrative subject, administrative relatives, time, place.

In some embodiments, analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech tags corresponding to the words includes:

splitting a text to be analyzed according to a legal dictionary by adopting a Chinese word segmentation tool and a part-of-speech tagging tool;

and acquiring all words of the analyzed text and part-of-speech labels corresponding to the words.

Specifically, a Chinese word segmentation tool is adopted to segment the French text to obtain all words in the text, and a part-of-speech tagging tool is adopted to respectively tag the parts of speech of all the words obtained from the clockwork text to obtain part-of-speech tags corresponding to the words. And the text to be analyzed is split according to the legal dictionary to obtain related words in the legal dictionary. For example: administrative body, administrative relatives, time, place, and other semantic roles.

It should be noted that the chinese word segmentation tool and the part-of-speech tagging tool used in the present application are both in the prior art, and are not described herein again.

In some embodiments, inputting all D-dimensional word vectors into a fully-connected neural network to obtain feature codes of all words includes:

sequentially inputting all D-dimensional word vectors into a fully-connected neural network, wherein the fully-connected neural network is provided with a feature encoder, and the feature encoder comprises 4 layers of stacked bidirectional L STMs, including a first layer L STM, a second layer L STM, a third layer L STM and a fourth layer L STM;

the first layer L STM is encoded with D-dimensional word vectors as input, then the input to each layer L STM is the output of the previous layer, and the fourth layer L STM outputs feature encoding.

Specifically, all D-dimensional word vectors are sequentially input into a feature encoder formed by a bidirectional L STM structure, the feature encoder is formed by 4 stacked bidirectional L STMs and comprises a first layer L0 STM, a second layer L STM, a third layer L STM and a fourth layer L STM, the first layer L STM uses the D-dimensional vectors as input for encoding, the second layer L STM uses the output of the first layer L STM as input, the input of each layer L STM is the output of the previous layer, and finally, the fourth layer L outputs feature encoding W32 STM_iTo improve the gradient disappearance phenomenon that occurs with multilayer L STM structures, a highway L STM structure was introduced in this application.

In some embodiments, comparing the part-of-speech tag of the text to be analyzed with the part-of-speech tag of the text in the preset database to obtain the best matching text in the preset database includes:

Preferably, the semantic role labels of the best matching text and the position information corresponding to the semantic role labels are vectorized to obtain a characteristic vector,

vectorizing the semantic role label of the best matching text to obtain a first vector representation;

Specifically, a longest character string matching method is used, a central predicate V is used as a center, the length of a character string matched to the two sides is L i, and the best matching text S is obtained_sim。

S_sim＝argmax(L_i)

And vectorizing the semantic role labeling result of the best matching text Ssim to obtain the vector representation of the best matching text. Specifically, vectorization is performed on semantic role labeling results in the best matching text to obtain dim1 dimensional vector representation Rsim. And simultaneously, coding the relative distance between each semantic role and the central predicate to obtain a dim 2-dimensional vector representation PEsim. And splicing the vectors Rsim and PEsim with the obtained feature codes of the text to be analyzed to obtain the final vector representation.

In some embodiments, inputting the final vector representation into a fully-connected neural network, and outputting a semantic character label for each word in the text to be analyzed, includes:

and inputting the final vector into a fully-connected neural network, wherein a softmax layer is arranged in the fully-connected neural network, semantic role labeling is carried out on each word by adopting a softmax classifier in the softmax layer, and the softmax layer outputs the semantic role labeling.

Specifically, the output Wi of the last layer of bidirectional L STM is spliced with the vectors Rsim and PEsim obtained in the step S103 to obtain a final vector representation [ Wi; Rsim; PEsim ], and after the final vector representation [ Wi; Rsim; PEsim ] is input into a fully-connected neural network, a multi-classification result is obtained through a Softmax layer, wherein the output of the Softmax layer is the semantic role mark of each word in the text to be analyzed relative to a given predicate.

Preferably, the word vector model provided in the present application includes:

word2vec language model, glove language model, or BERT language model.

As shown in fig. 2, the present application provides a french semantic labeling apparatus based on L STM network, including:

the preprocessing module 201 is configured to acquire a text and preprocess the text to acquire a text to be analyzed;

the first processing module 202 is configured to analyze a text to be analyzed to obtain all words of the text to be analyzed and part-of-speech tags corresponding to the words, convert all the words into D-dimensional word vectors by using a word vector model, and input all the D-dimensional word vectors into a fully-connected neural network to obtain feature codes of all the words;

the second processing module 203 is configured to compare part-of-speech tags of the text to be analyzed with part-of-speech tags of the text in the preset database to obtain a best-matching text in the preset database, and vectorize semantic role tags of the best-matching text and position information corresponding to the semantic role tags to obtain feature vectors;

an obtaining module 204, configured to compound the feature codes and the feature vectors to obtain a final vector representation;

and the output module 205 is configured to input the final vector representation into a fully-connected neural network, and output a semantic role label of each word in the text to be analyzed.

The operating principle of the L STM network-based legal notation device is that a preprocessing module 201 obtains a text and preprocesses the text to obtain the text to be analyzed, a first processing module 202 analyzes the text to be analyzed to obtain all words and part-of-speech labels corresponding to the words of the text to be analyzed, a word vector model is adopted to convert all the words into D-dimensional word vectors, all the D-dimensional word vectors are input into a fully-connected neural network to obtain feature codes of all the words, a second processing module 203 compares the part-of-speech labels of the text to be analyzed with the part-of-speech labels of the text in a preset database to obtain a most-matched text in the preset database, the semantic label roles of the most-matched text and the position information corresponding to the semantic label roles are vectorized to obtain the feature vectors, an obtaining module 204 compounds the feature codes and the feature vectors to obtain a final vector representation, and an output module 205 inputs the final vector representation into the fully-connected neural network to output the role labels of each word in the text to be analyzed.

The invention provides a method and a device for semantic annotation of a law bar based on an L STM network, which comprises the steps of obtaining a text and preprocessing the text to obtain the text to be analyzed, analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech annotations corresponding to the words, converting all the words into D-dimensional word vectors by adopting a word vector model, inputting all the D-dimensional word vectors into a fully-connected neural network to obtain feature codes of all the words, comparing the part-of-speech annotations of the text to be analyzed with the part-of-speech annotations of the text in a preset database to obtain a best-matched text in the preset database, vectorizing position information corresponding to the semantic role annotations and the semantic role annotations of the best-matched text to obtain feature vectors, compounding the feature codes and the feature vectors to obtain a final vector representation, inputting the final vector representation into the fully-connected neural network, outputting the semantic annotations of each word in the text to be analyzed, automatically analyzing elements of a performer, a receiver, time, a place and the like in the law bar, and providing efficient working efficiency for higher-level working staff.

It is to be understood that the apparatus embodiments provided above correspond to the method embodiments described above, and corresponding specific contents may be referred to each other, which are not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A French semantic annotation method based on L STM network is characterized by comprising the following steps:

acquiring a text and preprocessing the text to acquire the text to be analyzed;

2. The method of claim 1, wherein the obtaining and pre-processing the text to obtain the text to be analyzed comprises:

3. The method of claim 2, wherein the center predicate comprises:

administrative subject, administrative relatives, time, place.

4. The method according to claim 1, wherein the analyzing the text to be analyzed to obtain all words of the text to be analyzed and part-of-speech tags corresponding to the words comprises:

5. The method of claim 1, wherein inputting all the D-dimensional vectors into a fully-connected neural network to obtain feature codes of all the words comprises:

sequentially inputting all the D-dimensional vectors into a fully-connected neural network, wherein the fully-connected neural network is provided with a feature encoder, and the feature encoder comprises a bidirectional L STM with 4 layers of stacks, and comprises a first layer L STM, a second layer L STM, a third layer L STM and a fourth layer L STM;

the first layer L STM is encoded with the D-dimensional vectors as input, then the input to each layer L STM is the output of the previous layer, and the fourth layer L STM outputs feature encoding.

6. The method of claim 2, wherein the comparing the part-of-speech tag of the text to be analyzed with the part-of-speech tag of the text in a preset database to obtain the best matching text in the preset database comprises:

7. The method of claim 6, wherein the semantic character label of the best matching text and the position information corresponding to the semantic character label are vectorized to obtain a feature vector,

8. The method of claim 1, wherein inputting the final vector representation into a fully-connected neural network and outputting a semantic role label for each word in the text to be analyzed comprises:

9. The method of any of claims 1 to 8, wherein the word vector model comprises:

word2vec language model, glove language model, or BERT language model.

10. A French semantic annotation device based on L STM network, characterized by comprising: