CN112883721A

CN112883721A - Method and device for recognizing new words based on BERT pre-training model

Info

Publication number: CN112883721A
Application number: CN202110165682.1A
Authority: CN
Inventors: 邵德奇; 石聪; 关培培; 朱经南; 赵诗阳; 冯超; 李腾飞; 段治平
Original assignee: Science And Technology Daily
Current assignee: Science And Technology Daily
Priority date: 2021-01-14
Filing date: 2021-02-06
Publication date: 2021-06-01
Anticipated expiration: 2041-02-06
Also published as: CN112883721B

Abstract

The invention provides a method and a device for recognizing new words based on a BERT pre-training model, which relate to the technical field of new word mining and comprise the steps of obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words; inputting new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases; extracting discrete characteristics of new words; and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, identifying correct new words, determining boundaries of the words through a shallow network of a BERT pre-training model, and further accurately identifying the correct new words.

Description

Method and device for recognizing new words based on BERT pre-training model

Technical Field

The invention relates to the technical field of new word mining, in particular to a new word recognition method and device based on a BERT pre-training model.

Background

With the rapid development of internet science and technology, some emerging words, namely 'new words', are often cast. In the current semantic recognition scenario, the meaning of a sentence cannot be correctly recognized because a new word in the sentence cannot be accurately recognized.

Disclosure of Invention

The invention aims to provide a method and a device for recognizing new words based on a BERT pre-training model, which are used for determining word boundaries through a shallow network of the BERT pre-training model so as to accurately recognize correct new words.

In a first aspect, an embodiment of the present invention provides a new word recognition method based on a BERT pre-training model, including:

obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;

inputting the new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic feature vectors and lexical feature vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;

extracting discrete features of the new words;

and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, and identifying correct new words.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of inputting the shallow dense vector and the discrete features into a DNN binary classification model to identify correct new words includes:

inputting the shallow dense vector and the discrete features into a DNN binary model;

judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word;

and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.

With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the method further includes:

and if the new word and the word are correct real words, adjusting the DNN two-classification model and the BERT pre-training model through the new word and the word feedback.

With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the method further includes:

and performing semantic recognition on the corpus information after the new words and expressions are recognized.

With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word terms includes:

and segmenting and filtering the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, wherein the new word words are byte fragments with various preset byte lengths.

With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the shallow network includes a layer 2 and a layer 3 of the BERT pre-training model.

With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the discrete features include left and right information entropies, mutual information, and statistics tf-idf.

In a second aspect, an embodiment of the present invention further provides a new word recognition apparatus based on a BERT pre-training model, including:

the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;

the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;

the extraction module is used for extracting discrete characteristics of the new words;

and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.

In a third aspect, an embodiment provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps of the method described in any one of the foregoing embodiments when executing the computer program.

In a fourth aspect, embodiments provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to carry out the steps of the method of any preceding embodiment.

The embodiment of the invention provides a BERT pre-training model-based new word recognition method and a device, wherein possible new words are recognized through an N-Gram word segmentation algorithm, a shallow network of the BERT pre-training model with a bidirectional self-attention network is introduced to determine syntactic characteristic vectors and lexical characteristic vectors of the new words, so that the boundary information of the new words is obtained, the discrete characteristics of the words and the shallow dense vectors comprising the syntactic characteristic vectors and the lexical characteristic vectors are input into a DNN two-classification model, and whether the new words are actually correct new words is judged, so that the purpose of accurately recognizing the new words is realized, and the semantic recognition application is more accurate.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a new word recognition method based on a BERT pre-training model according to an embodiment of the present invention;

fig. 2 is a schematic diagram of functional modules of a new word recognition apparatus based on a BERT pre-training model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a hardware architecture of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the current semantic recognition process, the new words cannot be accurately recognized, so that the accurate meanings of the sentences containing the new words cannot be recognized, for example, the small liriots become a slashed youth. The word combination after the word segmentation by adopting the old word algorithm comprises the following steps: the meaning of the sentence cannot be accurately recognized during semantic recognition and analysis based on the word combination of the small plum, the standing word, the one-word, the slash and the youth. Therefore, accurate recognition of new words is particularly important in semantic recognition applications.

The extraction of new words is generally performed by extracting discrete features of words, and the new words extracted in this way include: the method is characterized in that small plum, small plum stand, and the like are formed into one, namely, a stroke cyan, a stroke magenta and a stroke cyan, and the like.

In the task of extracting new words, whether a word is a new word depends on the independence of the context, and the RNN recurrent neural network only concerns that the state of one moment is influenced by the previous moment, so that the RNN network is not suitable at the moment; and the new word extraction needs to pay attention to longer sentence information, and the CNN convolutional neural network usually only pays attention to local information, so the CNN is not accurate enough.

The inventor researches and discovers that the self-attention mechanism network can well solve the language bidirectional problem and the global information extraction problem, and the extracted word embedding vector can better represent whether a word can be independently formed into a new word. A Bert pre-training model of a bidirectional self-attention mechanism Network is introduced, and the self-attention mechanism Network has many advantages compared with the traditional Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and other Networks.

Based on this, the method and the device for recognizing the new words based on the BERT pre-training model provided by the embodiment of the invention determine the boundaries of the words through the shallow network of the BERT pre-training model, and further accurately recognize the correct new words.

To facilitate understanding of the embodiment, first, a detailed description is given to a new word recognition method based on a BERT pre-training model disclosed in the embodiment of the present invention.

Fig. 1 is a flowchart of a new word recognition method based on a BERT pre-training model according to an embodiment of the present invention.

Referring to fig. 1, the method includes the steps of:

and S102, obtaining the corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words.

The corpus information comprises at least one sentence, and each sentence is composed of words.

It should be noted that N-Gram is a word segmentation algorithm based on a large vocabulary continuous speech recognition language model, and the basic idea is to perform a sliding window operation with a size of N on the content in a text according to bytes, so as to form a byte fragment sequence with a length of N. Each byte segment (word) is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset new word threshold value to form a key gram list, namely a vector feature space of the text, each gram in the list is a feature vector dimension, and the list comprises a plurality of new word words.

For example, the step S102 may further specifically include an implementation manner that, through an N-Gram word segmentation algorithm, the corpus information is segmented into words with various preset byte lengths, and then the words are filtered, that is, old words existing in an old lexicon and an old lexical method are filtered, and then a plurality of new word words are generated through filtering, where the new word words are byte segments with various preset byte lengths. Preferably, the maximum length of the preset byte does not exceed 4, i.e. the preset byte length comprises 1-4. Thus, the example sentence described above can be segmented into little lie, lie, one, italic cyan, italic young, and the like. However, the divided new words may be real new words or may not actually form words, and therefore, in order to accurately identify correct new words, the new words are verified through the following steps.

And S104, inputting the new words into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein the BERT pre-training model is introduced with a bidirectional self-attention network, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words, and the shallow dense vectors are used for identifying boundary information of the new words.

In some embodiments, the shallow network includes layers 2 and 3 of the BERT pre-trained model, referring to the lower network structure of the BERT pre-trained model.

The BERT pre-training model introduced into the bidirectional self-attention network can define the boundaries of the new words and phrases, and therefore the correct authenticity of the new words and phrases can be conveniently identified. It should be noted that the technology of introducing the multi-directional self-attention network into the BERT pre-training model can be implemented by those skilled in the art.

And step S106, extracting discrete characteristics of the new words.

The discrete features include left and right information entropy, mutual information, statistical values tf-idf and the like.

As an alternative embodiment, the discrete features are extracted from the new word and term according to a statistical algorithm, and are used for characterizing some characteristics of the new word and term, so that the DNN binary classification model can accurately classify the new word and term.

And step S108, inputting the shallow dense vectors and the discrete features into the DNN two-classification model, and identifying correct new words.

In a preferred embodiment of practical application, possible new word and expression are identified through an N-Gram word segmentation algorithm, a shallow network of a BERT pre-training model with a bidirectional self-attention network is introduced to determine a syntactic characteristic vector and a lexical characteristic vector of the new word and expression, so that the boundary information of the new word and expression is obtained, discrete characteristics of the word and a shallow dense vector comprising the syntactic characteristic vector and the lexical characteristic vector are input into a DNN two-class model, whether the new word and expression is actually a correct new word or not is judged, and the purpose of accurately identifying the new word and expression is achieved, so that the semantic identification application is more accurate.

In some embodiments, step S108 in the above embodiments may also be specifically implemented by the following steps, which specifically include:

step 1.1), inputting the shallow dense vectors and the discrete features into a DNN two-classification model.

Wherein, the Deep Neural Networks (DNN) two-classification model judges the real correctness of the input new words based on the characteristics and the boundary information of the new words.

As an alternative embodiment, the DNN binary model may identify the authenticity of a single new word per input, or, alternatively, identify the authenticity of each of several new words per batch input as a whole.

And step 1.2), judging whether the new word is a correct real word or not according to an output result, wherein the output result comprises the probability that the new word is the correct real word.

And step 1.3), if the probability that the new word is the correct true word is greater than the preset probability value, the new word is the correct true word.

Here, the output result of each new word term includes its belonging category and corresponding probability label, for example, the new word term "slash youth" belongs to the correct true word category, the corresponding probability may be eighty percent, the new word term "slash youth" belongs to the correct true word category, the corresponding probability may be forty percent, and so on. And setting a preset probability value which accords with the correct real word according to actual requirements or user customization, wherein if seventy percent, namely the preset probability value is exceeded, the new word 'slash youth' is identified as the correct real word, and the new word 'slash youth' is not the correct real new word.

In some embodiments, in order to more accurately identify the sentence semantics including the new word, the method provided in the above embodiments further includes:

and 2.1) if the new word is a correct real word, adjusting the DNN two-classification model and the BERT pre-training model through the new word feedback.

In some embodiments, the lexicon, the DNN binary classification model, the BERT pre-training model and the like are adjusted by the identified correct new words, which is equivalent to that the new words become conventional old words in the current lexicon, the words can be correctly divided in the semantic recognition application and subjected to semantic recognition, and in the subsequent new word recognition process, the words do not need to be taken as new words and the correctness is judged. In another optional embodiment, the DNN dichotomy model and the BERT pre-training model which are subjected to the new word feedback adjustment are applied to another new word recognition scene, the new word may not be stored in a lexicon in the scene, that is, the new word is still recognized through an N-Gram word segmentation algorithm, and the DNN dichotomy model and the BERT pre-training model are input, so that the DNN dichotomy model and the BERT pre-training model can more accurately know that the new word is correct and real, the probability value of the output result is higher, and the judgment on the correctness of the new word is more accurate.

In some embodiments, the method provided in the above embodiments further comprises:

and 3,1) performing semantic recognition on the corpus information after the new words and phrases are recognized.

It can be understood that in the subsequent semantic recognition scene, the corpus information divided by the real and correct new words and expressions can be subjected to more accurate semantic recognition.

As shown in fig. 2, an embodiment of the present invention provides a new word recognition apparatus based on a BERT pre-training model, including:

In some embodiments, the identifying module is further specifically configured to input the shallow dense vector and the discrete features into a DNN classification model; judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word; and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.

In some embodiments, the apparatus further comprises an adjusting module configured to adjust the DNN binary model and the BERT pre-training model via the new word feedback if the new word is a probability of a correct true word.

In some embodiments, the apparatus further includes a semantic recognition module, configured to perform semantic recognition on the corpus information after the new word and the new word are recognized.

In some embodiments, the obtaining module is further specifically configured to segment and filter the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, where the new word words are byte segments with a plurality of preset byte lengths.

In some embodiments, the shallow network includes layers 2 and 3 of the BERT pre-trained model.

In some embodiments, the discrete features include left and right entropy, mutual information, and statistics tf-idf.

In this embodiment, the electronic device may be, but is not limited to, a Computer device with analysis and processing capabilities, such as a Personal Computer (PC), a notebook Computer, a monitoring device, and a server.

As an exemplary embodiment, referring to fig. 3, the electronic device 120 includes a communication interface 121, a processor 122, a memory 123, and a bus 124, wherein the processor 122, the communication interface 121, and the memory 123 are connected by the bus 124; the memory 123 is used for storing a computer program that supports the processor 122 to execute the image sharpening method, and the processor 122 is configured to execute the program stored in the memory 123.

A machine-readable storage medium as referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

The non-volatile medium may be non-volatile memory, flash memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, dvd, etc.), or similar non-volatile storage medium, or a combination thereof.

It can be understood that, for the specific operation method of each functional module in this embodiment, reference may be made to the detailed description of the corresponding step in the foregoing method embodiment, and no repeated description is provided herein.

The computer-readable storage medium provided in the embodiments of the present invention stores a computer program, and when executed, the computer program code may implement the new word recognition method based on the BERT pre-training model according to any of the above embodiments, and specific implementation may refer to method embodiments, which are not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A new word recognition method based on a BERT pre-training model is characterized by comprising the following steps:

extracting discrete features of the new words;

2. The method for identifying new words based on a BERT pre-training model as claimed in claim 1, wherein the step of inputting the shallow dense vectors and the discrete features into a DNN two-classification model to identify correct new words comprises:

3. The method of claim 2, further comprising:

4. The method of claim 1, further comprising:

5. The method for recognizing new words based on a BERT pre-training model as claimed in claim 1, wherein the step of performing word segmentation processing on the corpus information by an N-Gram word segmentation algorithm to obtain a plurality of new word words comprises:

6. The method of claim 1, wherein the shallow network comprises layers 2 and 3 of the BERT pre-trained model.

7. The method of claim 1, wherein the discrete features include left-right entropy, mutual information, and statistics tf-idf.

8. A new word recognition device based on a BERT pre-training model is characterized by comprising:

9. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, characterized in that a computer program is stored in the readable storage medium, which computer program, when executed, implements the method of any of claims 1-7.