CN111859933A

CN111859933A - Training method, recognition method, device and equipment of Malay recognition model

Info

Publication number: CN111859933A
Application number: CN202010393898.9A
Authority: CN
Inventors: 付颖雯; 林楠铠; 蒋盛益
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-10-30
Anticipated expiration: 2040-05-11
Also published as: CN111859933B

Abstract

The invention discloses a training method of a Malay recognition model, which comprises the following steps: acquiring at least one Malay sentence to be trained from a preset Malay database; wherein, the same word with different parts of speech is applied to the Malay sentence to be trained; acquiring semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic information extraction model; the semantic information comprises word vector characteristic information and text characteristic information; coding the semantic information according to a preset Bi-LSTM model; and respectively inputting the coded semantic information into a part of speech tagging task module and a named entity recognition task module for training to obtain a trained maleic language recognition model. The embodiment of the invention also provides a method, a device, equipment and a medium for identifying the Malay, which effectively solve the problem of poor accuracy of Malay identification in the prior art.

Description

Training method, recognition method, device and equipment of Malay recognition model

Technical Field

The invention relates to the technical field of maleic language recognition, in particular to a training method, a recognition method, a device and equipment of a maleic language recognition model.

Background

Malaysia can currently be recognized in three ways. Firstly, identification is carried out based on domain rules; secondly, recognition is carried out based on statistical machine learning, and thirdly, recognition is carried out based on combination of rules and statistics.

However, the identification method based on the rule needs to construct a complex rule to implement, needs to spend a lot of manpower and material resources, and usually needs to complete the rule through multiple evaluations and repeated corrections. The prepared rule can only have good effect in the corresponding field, and the requirement can not be met if the field is changed; the statistical model based on statistical machine learning lacks generalization capability in the aspect of feature learning, and models in different fields cannot be used universally; although the recognition method based on the combination of the rules and the statistics can make up the defects of the two methods, the same word with different parts of speech still cannot be recognized well, for example, in two sentences, namely 'i eat an apple' and 'i have an apple mobile phone', the two apples point to completely different things, but still share the same word vector, and the accuracy of the current Malay recognition is poor.

Disclosure of Invention

The embodiment of the invention provides a training method of a Malay recognition model, a Malay recognition method, a Malay recognition device, equipment and a medium, which can effectively solve the problem of poor accuracy of Malay recognition in the prior art.

An embodiment of the present invention provides a training method for a maleic language recognition model, including:

acquiring at least one Malay sentence to be trained from a preset Malay database; wherein, the same word with different parts of speech is applied to the Malay sentence to be trained;

acquiring semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic information extraction model; the semantic information comprises word vector characteristic information and text characteristic information;

coding the semantic information according to a preset Bi-LSTM model;

and respectively inputting the coded semantic information into a part of speech tagging task module and a named entity recognition task module for training to obtain a trained maleic language recognition model.

As an improvement of the above scheme, the obtaining of semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic language information extraction model specifically includes:

Acquiring word vector characteristic information corresponding to the to-be-trained Malay sentence according to a word vector extraction model;

and extracting text characteristic information corresponding to the to-be-trained Malay sentence according to the convolutional neural network.

As an improvement of the above scheme, encoding the semantic information according to a preset Bi-LSTM model specifically includes:

the forward LSTM layer encodes the word vector characteristic information, and the backward LSTM layer encodes the text characteristic information;

and splicing the coded word vector characteristic information and the coded text characteristic information to obtain coded semantic information.

As an improvement of the above scheme, the respectively inputting the encoded semantic information to the part-of-speech tagging task module and the named entity recognition task module for training to obtain a trained malay recognition model specifically includes:

respectively inputting the coded semantic information to a Multi-header self-orientation layer of the part-of-speech tagging task module, and then inputting the semantic information to a corresponding loss function for calculation to obtain a first loss value;

respectively inputting the coded semantic information to a Multi-header-entry layer of the named entity recognition task module, and then inputting the semantic information to a corresponding loss function for calculation to obtain a second loss value;

And training a Malay recognition model according to the loss value and the preset training times to obtain the Malay recognition model after training.

The invention correspondingly provides a method for identifying Malay in another embodiment, which comprises the following steps:

acquiring a trained Malay recognition model; the trained maleic language recognition model is trained through maleic language sentences containing the same word with different parts of speech;

and identifying the to-be-processed Malay sentences according to the trained Malay identification model to obtain the named entities corresponding to the to-be-processed Malay sentences.

Another embodiment of the present invention correspondingly provides a training device for a maleic language recognition model, comprising:

the first acquisition module is used for acquiring at least one maleic sentence to be trained from a preset maleic language database; wherein, the same word with different parts of speech is applied to the Malay sentence to be trained;

the information extraction module is used for acquiring semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic information extraction model; the semantic information comprises word vector characteristic information and text characteristic information;

The encoding module is used for encoding the semantic information according to a preset Bi-LSTM model;

and the training module is used for inputting the coded semantic information into the part-of-speech tagging task module and the named entity recognition task module respectively for training to obtain a trained maleic language recognition model.

Another embodiment of the present invention provides a device for identifying malay, including:

the second acquisition module is used for acquiring the trained maleic language recognition model; the trained maleic language recognition model is trained through maleic language sentences containing the same word with different parts of speech;

and the recognition module is used for recognizing the to-be-processed maleic sentence according to the trained maleic recognition model to obtain the named entity corresponding to the to-be-processed maleic sentence.

Another embodiment of the present invention provides a training apparatus for a maleic language recognition model, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the training method for the maleic language recognition model according to the above embodiment of the present invention.

Another embodiment of the present invention provides a storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, a device where the computer-readable storage medium is located is controlled to execute the training method for a maleic language identification model according to the above embodiment of the present invention.

The embodiment of the invention has the following beneficial effects:

firstly, obtaining a maleic sentence containing the same word with different parts of speech, and then obtaining semantic information corresponding to the maleic sentence according to a preset maleic information extraction model; the feature information comprises word vector feature information and text feature information, the semantic information is coded according to a preset Bi-LSTM model, the coded semantic information is input to a part-of-speech tagging task module and a named entity recognition task module respectively for training, a trained maleic language recognition model is obtained, and the trained maleic language recognition model can recognize maleic sentences to obtain corresponding named entities, so that the same word can be recognized well in different fields, and the accuracy of maleic language recognition is improved.

Drawings

FIG. 1 is a schematic flow chart of a training method for a Malay recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network computation provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Bi-LSTM model provided in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart illustrating a Malay recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a training apparatus for a Malay recognition model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a recognition apparatus for malaysian according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a training apparatus for a maleic language identification model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of training of a maleic language recognition model according to an embodiment of the present invention.

s10, acquiring at least one Malay sentence to be trained from a preset Malay database; wherein the same word with different parts of speech is applied to the Malay sentence to be trained.

It should be noted that the original Malay language material is obtained by crawling a large amount of Malay news, which covers political, financial, social, military and other subjects. Wherein, news website includes: http:// www.bharian.com.my/, http:// www.utusan.com.my/, http:// www.theborneopost.com/, http:// www.malaysiakini.com/bm, http:// www.hmetro.com.my/. And collecting a large number of named entity instances from DBpedia and the obtained original Malaysia, and constructing a preset Malaysia database. Training and testing a Malay recognition model by using the language library; and manually checking sentences with difference between the test result and the original material library. In order to ensure the quality of the examination, each sentence to be examined is examined by a plurality of examiners, the examination result is put into the Malay recognition model for training and mutual iteration, and finally the quality of the Malay database is improved at the same time.

S20, acquiring semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic information extraction model; the semantic information comprises word vector characteristic information and text characteristic information.

Specifically, since the Malay belongs to the adhesion type language, the character-level features can reflect the morphological features of the Malay vocabulary. Since the same word has different meanings in different fields, word vector feature information and text feature information need to be extracted, so that the word can be better identified.

And S30, encoding the semantic information according to a preset Bi-LSTM model.

It should be noted that, the Bi-LSTM model can obtain information from multiple directions at the same time, so that the Bi-LSTM model is used for encoding, so that the result is more accurate.

And S40, respectively inputting the coded semantic information to the part-of-speech tagging task module and the named entity recognition task module for training, and acquiring a trained maleic language recognition model.

Specifically, as the two tasks are alternately trained in each iteration, the training speed of the model is increased, and the performance of the named entity recognition task is also improved.

In summary, firstly, the maleic sentences containing the same word with different parts of speech are obtained, and then the semantic information corresponding to the maleic sentences is obtained according to the preset maleic information extraction model; the feature information comprises word vector feature information and text feature information, the semantic information is coded according to a preset Bi-LSTM model, the coded semantic information is input to a part-of-speech tagging task module and a named entity recognition task module respectively for training, a trained maleic language recognition model is obtained, and the trained maleic language recognition model can recognize maleic sentences to obtain corresponding named entities, so that the same word can be recognized well in different fields, and the accuracy of maleic language recognition is improved.

In the foregoing embodiment, preferably, the obtaining semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic language information extraction model, and step S20 specifically includes:

s201, obtaining word vector characteristic information corresponding to the to-be-trained Malay sentence according to the word vector extraction model.

In the embodiment, static word vectors are extracted through GloVe, dynamic word vectors are extracted through BERT, and the static word vectors and the dynamic word vectors are combined to form word vector feature information.

S202, extracting text characteristic information corresponding to the to-be-trained Malay sentence according to the convolutional neural network.

In this embodiment, referring to fig. 2, the GLU model-based convolution structure includes two parts: a simple one-dimensional convolution and a one-dimensional convolution activated using a sigmoid function. Meanwhile, a residual error structure is adopted to enable character information to be transmitted in multiple channels:

x is the character embedding of a word,

is a dot product operation. "Gate" structure σ (Conv 1D)₂(X)) corresponds to Conv 1D₁Each output of (X) is added to a "valve" to control flow. And the combination of the residual error and the gate convolution achieves the effect of multi-channel transmission.

In the foregoing embodiment, preferably, the encoding the semantic information according to a preset Bi-LSTM model, and step S30 specifically includes:

S301, the forward LSTM layer encodes the word vector characteristic information, and the backward LSTM layer encodes the text characteristic information.

And S302, splicing the coded word vector characteristic information and the coded text characteristic information to obtain coded semantic information.

Specifically, referring to fig. 3, if the word vector feature information is represented as h and the text feature information is represented as x, the Bi-LSTM model can be used to model sentences from the front and the back, so that not only the preceding information can be saved, but also the following information can be considered.

In the foregoing embodiment, preferably, the step S40 specifically includes inputting the encoded semantic information to the part-of-speech tagging task module and the named entity recognition task module respectively for training, and obtaining a trained maleic language recognition model, where:

s401, the encoded semantic information is respectively input to a Multi-header self-orientation layer of the part-of-speech tagging task module, and then input to a corresponding loss function for calculation, so as to obtain a first loss value.

S402, the coded semantic information is respectively input to a Multi-head Self-orientation layer of the named entity recognition task module, and then input to a corresponding loss function for calculation, so as to obtain a second loss value.

It should be noted that the Multi-head Self-orientation mechanism further uses different Self-orientation mechanisms to obtain enhanced semantic vectors of each word in the text in different semantic spaces, and linearly combines a plurality of enhanced semantic vectors of each word, so as to obtain a final enhanced semantic vector with the same length as the original word vector. The Self-orientation mechanism is characterized in that the influence of each character of a sequence on the upper and lower characters of the sequence is different, the contribution of each word to semantic information of the sequence is also different, and a new vector is generated by weighting and fusing the semantic vector information of all characters in the sequence with the character vector in the original input sequence through the Self-orientation mechanism, namely the original semantic information is enhanced.

And S403, performing maleic language recognition model training according to the loss value and the preset training times, and obtaining a trained maleic language recognition model. It is understood that the preset training times can be set according to the requirement, and are not limited herein.

Specifically, in the present embodiment, the loss function is a softmax function.

Loss in the formula_POSLoss value, Loss, of part-of-speech tag_NERIs a loss value for the named entity recognition task.

Because the labeling of the same word in different tasks may depend on different information, a Multi-head Self-orientation layer is respectively connected to the different tasks on the Bi-LSTM model, and then a loss function is added, so that the performance of the named entity recognition task is effectively improved. And because the two tasks are alternately trained in each iteration, the training speed of the model is improved, and the accuracy of named entity recognition is also improved.

Fig. 4 is a schematic flow chart of a method for identifying malay according to an embodiment of the present invention.

The embodiment of the invention correspondingly provides a method for identifying Malay, which comprises the following steps:

s1, acquiring a trained Malay recognition model; the trained maleic language recognition model is trained through maleic language sentences containing the same word with different parts of speech;

and S2, recognizing the to-be-processed maleic sentence according to the trained maleic recognition model to obtain the named entity corresponding to the to-be-processed maleic sentence.

It should be noted that the training method of the maleic language recognition model includes:

S20, acquiring semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic information extraction model; the feature information comprises word vector feature information and text feature information;

and S30, encoding the semantic information according to a preset Bi-LSTM model.

The embodiment of the invention correspondingly provides a method for identifying the Malay, and the trained Malay identification model can identify the Malay sentences to obtain the corresponding named entities, so that the Malay sentences can be well identified in different fields, and the Malay identification accuracy is improved.

Fig. 5 is a schematic structural diagram of a training apparatus for a maleic language identification model according to an embodiment of the present invention.

The embodiment of the invention correspondingly provides a training device of a Malay recognition model, which comprises:

the first acquiring module 10 is configured to acquire at least one maleic sentence to be trained from a preset maleic language database; wherein the same word with different parts of speech is applied to the Malay sentence to be trained.

The information extraction module 20 is configured to obtain semantic information corresponding to the to-be-trained maleic sentence according to a preset maleic language information extraction model; the semantic information comprises word vector characteristic information and text characteristic information.

And the coding module 30 is used for coding the semantic information according to a preset Bi-LSTM model.

And the training module 40 is used for inputting the coded semantic information into the part-of-speech tagging task module and the named entity recognition task module respectively for training, and acquiring a trained maleic language recognition model.

The embodiment of the invention provides a training device of a maleic language recognition model, which comprises the steps of firstly obtaining maleic language sentences containing same words with different parts of speech, and then obtaining semantic information corresponding to the maleic language sentences according to a preset maleic language information extraction model; the feature information comprises word vector feature information and text feature information, the semantic information is coded according to a preset Bi-LSTM model, the coded semantic information is input to a part-of-speech tagging task module and a named entity recognition task module respectively for training, a trained maleic language recognition model is obtained, and the trained maleic language recognition model can recognize maleic sentences to obtain corresponding named entities, so that the same word can be recognized well in different fields, and the accuracy of maleic language recognition is improved.

Fig. 6 is a schematic structural diagram of a maleic language recognition apparatus according to an embodiment of the present invention.

The embodiment of the invention provides a maleic language recognition device, which comprises:

the second acquisition module 1 is used for acquiring a trained maleic language recognition model; and the trained Malay recognition model is trained through Malay sentences containing the same word with different parts of speech.

And the recognition module 2 is used for recognizing the to-be-processed maleic sentence according to the trained maleic recognition model to obtain the named entity corresponding to the to-be-processed maleic sentence.

The embodiment of the invention correspondingly provides a maleic language recognition device, and the trained maleic language recognition model can recognize the maleic sentence to obtain the corresponding named entity, so that the same word can be recognized well in different fields, and the accuracy of maleic language recognition is improved.

Fig. 7 is a schematic diagram of training equipment for a maleic language recognition model according to an embodiment of the present invention. The training apparatus of the maleic language recognition model of this embodiment includes: a processor 11, a memory 12 and a computer program stored in said memory 12 and executable on said processor 11. The processor 11 implements the steps in the embodiments of the syntax error correction method for indonesia described above when executing the computer program. Alternatively, the processor 11 implements the functions of the modules/units in the above-described device embodiments when executing the computer program.

Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 12 and executed by the processor 11 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the syntax error correction device of the indonesia.

The training device of the Malay recognition model can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing devices. The training device of the Malay recognition model may include, but is not limited to, a processor, and a memory. It will be understood by those skilled in the art that the schematic diagram is merely an example of a training device for a maleic language recognition model, and does not constitute a limitation on the training device for a maleic language recognition model, and may include more or fewer components than those shown, or some components in combination, or different components, for example, the training device for a maleic language recognition model may further include an input-output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor is a control center of the training apparatus of the Malay recognition model, and various interfaces and lines are used to connect various parts of the training apparatus of the entire Malay recognition model.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the training apparatus of the Malay recognition model by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the modules/units integrated by the training device of the Malay recognition model can be stored in a computer readable storage medium if the modules/units are realized in the form of software functional units and sold or used as independent products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A training method of a Malay recognition model is characterized by comprising the following steps:

coding the semantic information according to a preset Bi-LSTM model;

2. The method for training a maleic language identification model according to claim 1, wherein the obtaining semantic information corresponding to the maleic sentence to be trained according to a preset maleic language information extraction model specifically comprises:

3. The method for training a maleic language identification model according to claim 1, wherein said encoding the semantic information according to a preset Bi-LSTM model specifically comprises:

4. The method for training a maleic language identification model according to claim 1, wherein the step of inputting the encoded semantic information to a part-of-speech tagging task module and a named entity identification task module for training respectively to obtain the trained maleic language identification model specifically comprises the steps of:

respectively inputting the coded semantic information to a Multi-head Self-orientation layer of the part-of-speech tagging task module, and then inputting the semantic information to a corresponding loss function for calculation to obtain a first loss value;

5. A method for identifying Malay, comprising:

6. A training device for a Malay recognition model is characterized by comprising:

7. A device for identifying malay, comprising:

8. A training device for a Malay recognition model, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the training method for the Malay recognition model according to any one of claims 1 to 4 when executing the computer program.

9. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the training method for a maleic language identification model according to any one of claims 1 to 4.