CN112883721A - Method and device for recognizing new words based on BERT pre-training model - Google Patents
Method and device for recognizing new words based on BERT pre-training model Download PDFInfo
- Publication number
- CN112883721A CN112883721A CN202110165682.1A CN202110165682A CN112883721A CN 112883721 A CN112883721 A CN 112883721A CN 202110165682 A CN202110165682 A CN 202110165682A CN 112883721 A CN112883721 A CN 112883721A
- Authority
- CN
- China
- Prior art keywords
- new
- word
- words
- shallow
- training model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 39
- 239000013598 vector Substances 0.000 claims abstract description 61
- 230000011218 segmentation Effects 0.000 claims abstract description 27
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 18
- 238000013145 classification model Methods 0.000 claims abstract description 18
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 10
- 230000014509 gene expression Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 3
- 238000005065 mining Methods 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012896 Statistical algorithm Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003707 image sharpening Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012806 monitoring device Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Abstract
The invention provides a method and a device for recognizing new words based on a BERT pre-training model, which relate to the technical field of new word mining and comprise the steps of obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words; inputting new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases; extracting discrete characteristics of new words; and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, identifying correct new words, determining boundaries of the words through a shallow network of a BERT pre-training model, and further accurately identifying the correct new words.
Description
Technical Field
The invention relates to the technical field of new word mining, in particular to a new word recognition method and device based on a BERT pre-training model.
Background
With the rapid development of internet science and technology, some emerging words, namely 'new words', are often cast. In the current semantic recognition scenario, the meaning of a sentence cannot be correctly recognized because a new word in the sentence cannot be accurately recognized.
Disclosure of Invention
The invention aims to provide a method and a device for recognizing new words based on a BERT pre-training model, which are used for determining word boundaries through a shallow network of the BERT pre-training model so as to accurately recognize correct new words.
In a first aspect, an embodiment of the present invention provides a new word recognition method based on a BERT pre-training model, including:
obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
inputting the new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic feature vectors and lexical feature vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
extracting discrete features of the new words;
and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, and identifying correct new words.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of inputting the shallow dense vector and the discrete features into a DNN binary classification model to identify correct new words includes:
inputting the shallow dense vector and the discrete features into a DNN binary model;
judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word;
and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.
With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the method further includes:
and if the new word and the word are correct real words, adjusting the DNN two-classification model and the BERT pre-training model through the new word and the word feedback.
With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the method further includes:
and performing semantic recognition on the corpus information after the new words and expressions are recognized.
With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word terms includes:
and segmenting and filtering the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, wherein the new word words are byte fragments with various preset byte lengths.
With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the shallow network includes a layer 2 and a layer 3 of the BERT pre-training model.
With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the discrete features include left and right information entropies, mutual information, and statistics tf-idf.
In a second aspect, an embodiment of the present invention further provides a new word recognition apparatus based on a BERT pre-training model, including:
the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
the extraction module is used for extracting discrete characteristics of the new words;
and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.
In a third aspect, an embodiment provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps of the method described in any one of the foregoing embodiments when executing the computer program.
In a fourth aspect, embodiments provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to carry out the steps of the method of any preceding embodiment.
The embodiment of the invention provides a BERT pre-training model-based new word recognition method and a device, wherein possible new words are recognized through an N-Gram word segmentation algorithm, a shallow network of the BERT pre-training model with a bidirectional self-attention network is introduced to determine syntactic characteristic vectors and lexical characteristic vectors of the new words, so that the boundary information of the new words is obtained, the discrete characteristics of the words and the shallow dense vectors comprising the syntactic characteristic vectors and the lexical characteristic vectors are input into a DNN two-classification model, and whether the new words are actually correct new words is judged, so that the purpose of accurately recognizing the new words is realized, and the semantic recognition application is more accurate.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a new word recognition method based on a BERT pre-training model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of functional modules of a new word recognition apparatus based on a BERT pre-training model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a hardware architecture of an electronic device according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the current semantic recognition process, the new words cannot be accurately recognized, so that the accurate meanings of the sentences containing the new words cannot be recognized, for example, the small liriots become a slashed youth. The word combination after the word segmentation by adopting the old word algorithm comprises the following steps: the meaning of the sentence cannot be accurately recognized during semantic recognition and analysis based on the word combination of the small plum, the standing word, the one-word, the slash and the youth. Therefore, accurate recognition of new words is particularly important in semantic recognition applications.
The extraction of new words is generally performed by extracting discrete features of words, and the new words extracted in this way include: the method is characterized in that small plum, small plum stand, and the like are formed into one, namely, a stroke cyan, a stroke magenta and a stroke cyan, and the like.
In the task of extracting new words, whether a word is a new word depends on the independence of the context, and the RNN recurrent neural network only concerns that the state of one moment is influenced by the previous moment, so that the RNN network is not suitable at the moment; and the new word extraction needs to pay attention to longer sentence information, and the CNN convolutional neural network usually only pays attention to local information, so the CNN is not accurate enough.
The inventor researches and discovers that the self-attention mechanism network can well solve the language bidirectional problem and the global information extraction problem, and the extracted word embedding vector can better represent whether a word can be independently formed into a new word. A Bert pre-training model of a bidirectional self-attention mechanism Network is introduced, and the self-attention mechanism Network has many advantages compared with the traditional Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and other Networks.
Based on this, the method and the device for recognizing the new words based on the BERT pre-training model provided by the embodiment of the invention determine the boundaries of the words through the shallow network of the BERT pre-training model, and further accurately recognize the correct new words.
To facilitate understanding of the embodiment, first, a detailed description is given to a new word recognition method based on a BERT pre-training model disclosed in the embodiment of the present invention.
Fig. 1 is a flowchart of a new word recognition method based on a BERT pre-training model according to an embodiment of the present invention.
Referring to fig. 1, the method includes the steps of:
and S102, obtaining the corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words.
The corpus information comprises at least one sentence, and each sentence is composed of words.
It should be noted that N-Gram is a word segmentation algorithm based on a large vocabulary continuous speech recognition language model, and the basic idea is to perform a sliding window operation with a size of N on the content in a text according to bytes, so as to form a byte fragment sequence with a length of N. Each byte segment (word) is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset new word threshold value to form a key gram list, namely a vector feature space of the text, each gram in the list is a feature vector dimension, and the list comprises a plurality of new word words.
For example, the step S102 may further specifically include an implementation manner that, through an N-Gram word segmentation algorithm, the corpus information is segmented into words with various preset byte lengths, and then the words are filtered, that is, old words existing in an old lexicon and an old lexical method are filtered, and then a plurality of new word words are generated through filtering, where the new word words are byte segments with various preset byte lengths. Preferably, the maximum length of the preset byte does not exceed 4, i.e. the preset byte length comprises 1-4. Thus, the example sentence described above can be segmented into little lie, lie, one, italic cyan, italic young, and the like. However, the divided new words may be real new words or may not actually form words, and therefore, in order to accurately identify correct new words, the new words are verified through the following steps.
And S104, inputting the new words into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein the BERT pre-training model is introduced with a bidirectional self-attention network, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words, and the shallow dense vectors are used for identifying boundary information of the new words.
In some embodiments, the shallow network includes layers 2 and 3 of the BERT pre-trained model, referring to the lower network structure of the BERT pre-trained model.
The BERT pre-training model introduced into the bidirectional self-attention network can define the boundaries of the new words and phrases, and therefore the correct authenticity of the new words and phrases can be conveniently identified. It should be noted that the technology of introducing the multi-directional self-attention network into the BERT pre-training model can be implemented by those skilled in the art.
And step S106, extracting discrete characteristics of the new words.
The discrete features include left and right information entropy, mutual information, statistical values tf-idf and the like.
As an alternative embodiment, the discrete features are extracted from the new word and term according to a statistical algorithm, and are used for characterizing some characteristics of the new word and term, so that the DNN binary classification model can accurately classify the new word and term.
And step S108, inputting the shallow dense vectors and the discrete features into the DNN two-classification model, and identifying correct new words.
In a preferred embodiment of practical application, possible new word and expression are identified through an N-Gram word segmentation algorithm, a shallow network of a BERT pre-training model with a bidirectional self-attention network is introduced to determine a syntactic characteristic vector and a lexical characteristic vector of the new word and expression, so that the boundary information of the new word and expression is obtained, discrete characteristics of the word and a shallow dense vector comprising the syntactic characteristic vector and the lexical characteristic vector are input into a DNN two-class model, whether the new word and expression is actually a correct new word or not is judged, and the purpose of accurately identifying the new word and expression is achieved, so that the semantic identification application is more accurate.
In some embodiments, step S108 in the above embodiments may also be specifically implemented by the following steps, which specifically include:
step 1.1), inputting the shallow dense vectors and the discrete features into a DNN two-classification model.
Wherein, the Deep Neural Networks (DNN) two-classification model judges the real correctness of the input new words based on the characteristics and the boundary information of the new words.
As an alternative embodiment, the DNN binary model may identify the authenticity of a single new word per input, or, alternatively, identify the authenticity of each of several new words per batch input as a whole.
And step 1.2), judging whether the new word is a correct real word or not according to an output result, wherein the output result comprises the probability that the new word is the correct real word.
And step 1.3), if the probability that the new word is the correct true word is greater than the preset probability value, the new word is the correct true word.
Here, the output result of each new word term includes its belonging category and corresponding probability label, for example, the new word term "slash youth" belongs to the correct true word category, the corresponding probability may be eighty percent, the new word term "slash youth" belongs to the correct true word category, the corresponding probability may be forty percent, and so on. And setting a preset probability value which accords with the correct real word according to actual requirements or user customization, wherein if seventy percent, namely the preset probability value is exceeded, the new word 'slash youth' is identified as the correct real word, and the new word 'slash youth' is not the correct real new word.
In some embodiments, in order to more accurately identify the sentence semantics including the new word, the method provided in the above embodiments further includes:
and 2.1) if the new word is a correct real word, adjusting the DNN two-classification model and the BERT pre-training model through the new word feedback.
In some embodiments, the lexicon, the DNN binary classification model, the BERT pre-training model and the like are adjusted by the identified correct new words, which is equivalent to that the new words become conventional old words in the current lexicon, the words can be correctly divided in the semantic recognition application and subjected to semantic recognition, and in the subsequent new word recognition process, the words do not need to be taken as new words and the correctness is judged. In another optional embodiment, the DNN dichotomy model and the BERT pre-training model which are subjected to the new word feedback adjustment are applied to another new word recognition scene, the new word may not be stored in a lexicon in the scene, that is, the new word is still recognized through an N-Gram word segmentation algorithm, and the DNN dichotomy model and the BERT pre-training model are input, so that the DNN dichotomy model and the BERT pre-training model can more accurately know that the new word is correct and real, the probability value of the output result is higher, and the judgment on the correctness of the new word is more accurate.
In some embodiments, the method provided in the above embodiments further comprises:
and 3,1) performing semantic recognition on the corpus information after the new words and phrases are recognized.
It can be understood that in the subsequent semantic recognition scene, the corpus information divided by the real and correct new words and expressions can be subjected to more accurate semantic recognition.
As shown in fig. 2, an embodiment of the present invention provides a new word recognition apparatus based on a BERT pre-training model, including:
the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
the extraction module is used for extracting discrete characteristics of the new words;
and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.
In some embodiments, the identifying module is further specifically configured to input the shallow dense vector and the discrete features into a DNN classification model; judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word; and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.
In some embodiments, the apparatus further comprises an adjusting module configured to adjust the DNN binary model and the BERT pre-training model via the new word feedback if the new word is a probability of a correct true word.
In some embodiments, the apparatus further includes a semantic recognition module, configured to perform semantic recognition on the corpus information after the new word and the new word are recognized.
In some embodiments, the obtaining module is further specifically configured to segment and filter the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, where the new word words are byte segments with a plurality of preset byte lengths.
In some embodiments, the shallow network includes layers 2 and 3 of the BERT pre-trained model.
In some embodiments, the discrete features include left and right entropy, mutual information, and statistics tf-idf.
In this embodiment, the electronic device may be, but is not limited to, a Computer device with analysis and processing capabilities, such as a Personal Computer (PC), a notebook Computer, a monitoring device, and a server.
As an exemplary embodiment, referring to fig. 3, the electronic device 120 includes a communication interface 121, a processor 122, a memory 123, and a bus 124, wherein the processor 122, the communication interface 121, and the memory 123 are connected by the bus 124; the memory 123 is used for storing a computer program that supports the processor 122 to execute the image sharpening method, and the processor 122 is configured to execute the program stored in the memory 123.
A machine-readable storage medium as referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.
The non-volatile medium may be non-volatile memory, flash memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, dvd, etc.), or similar non-volatile storage medium, or a combination thereof.
It can be understood that, for the specific operation method of each functional module in this embodiment, reference may be made to the detailed description of the corresponding step in the foregoing method embodiment, and no repeated description is provided herein.
The computer-readable storage medium provided in the embodiments of the present invention stores a computer program, and when executed, the computer program code may implement the new word recognition method based on the BERT pre-training model according to any of the above embodiments, and specific implementation may refer to method embodiments, which are not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.
Claims (10)
1. A new word recognition method based on a BERT pre-training model is characterized by comprising the following steps:
obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
inputting the new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic feature vectors and lexical feature vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
extracting discrete features of the new words;
and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, and identifying correct new words.
2. The method for identifying new words based on a BERT pre-training model as claimed in claim 1, wherein the step of inputting the shallow dense vectors and the discrete features into a DNN two-classification model to identify correct new words comprises:
inputting the shallow dense vector and the discrete features into a DNN binary model;
judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word;
and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.
3. The method of claim 2, further comprising:
and if the new word and the word are correct real words, adjusting the DNN two-classification model and the BERT pre-training model through the new word and the word feedback.
4. The method of claim 1, further comprising:
and performing semantic recognition on the corpus information after the new words and expressions are recognized.
5. The method for recognizing new words based on a BERT pre-training model as claimed in claim 1, wherein the step of performing word segmentation processing on the corpus information by an N-Gram word segmentation algorithm to obtain a plurality of new word words comprises:
and segmenting and filtering the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, wherein the new word words are byte fragments with various preset byte lengths.
6. The method of claim 1, wherein the shallow network comprises layers 2 and 3 of the BERT pre-trained model.
7. The method of claim 1, wherein the discrete features include left-right entropy, mutual information, and statistics tf-idf.
8. A new word recognition device based on a BERT pre-training model is characterized by comprising:
the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
the extraction module is used for extracting discrete characteristics of the new words;
and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.
9. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, characterized in that a computer program is stored in the readable storage medium, which computer program, when executed, implements the method of any of claims 1-7.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2021100511149 | 2021-01-14 | ||
CN202110051114 | 2021-01-14 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112883721A true CN112883721A (en) | 2021-06-01 |
CN112883721B CN112883721B (en) | 2024-01-19 |
Family
ID=76055944
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110165682.1A Active CN112883721B (en) | 2021-01-14 | 2021-02-06 | New word recognition method and device based on BERT pre-training model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112883721B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343688A (en) * | 2021-06-22 | 2021-09-03 | 南京星云数字技术有限公司 | Address similarity determination method and device and computer equipment |
CN114841155A (en) * | 2022-04-21 | 2022-08-02 | 科技日报社 | Intelligent theme content aggregation method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460162A (en) * | 2020-04-11 | 2020-07-28 | 科技日报社 | Text classification method and device, terminal equipment and computer readable storage medium |
CN111563143A (en) * | 2020-07-20 | 2020-08-21 | 上海二三四五网络科技有限公司 | Method and device for determining new words |
CN111581374A (en) * | 2020-05-09 | 2020-08-25 | 联想(北京)有限公司 | Text abstract obtaining method and device and electronic equipment |
CN111783419A (en) * | 2020-06-12 | 2020-10-16 | 上海东普信息科技有限公司 | Address similarity calculation method, device, equipment and storage medium |
CN112214601A (en) * | 2020-10-21 | 2021-01-12 | 厦门市美亚柏科信息股份有限公司 | Social short text sentiment classification method and device and storage medium |
-
2021
- 2021-02-06 CN CN202110165682.1A patent/CN112883721B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460162A (en) * | 2020-04-11 | 2020-07-28 | 科技日报社 | Text classification method and device, terminal equipment and computer readable storage medium |
CN111581374A (en) * | 2020-05-09 | 2020-08-25 | 联想(北京)有限公司 | Text abstract obtaining method and device and electronic equipment |
CN111783419A (en) * | 2020-06-12 | 2020-10-16 | 上海东普信息科技有限公司 | Address similarity calculation method, device, equipment and storage medium |
CN111563143A (en) * | 2020-07-20 | 2020-08-21 | 上海二三四五网络科技有限公司 | Method and device for determining new words |
CN112214601A (en) * | 2020-10-21 | 2021-01-12 | 厦门市美亚柏科信息股份有限公司 | Social short text sentiment classification method and device and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343688A (en) * | 2021-06-22 | 2021-09-03 | 南京星云数字技术有限公司 | Address similarity determination method and device and computer equipment |
CN114841155A (en) * | 2022-04-21 | 2022-08-02 | 科技日报社 | Intelligent theme content aggregation method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112883721B (en) | 2024-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112084337B (en) | Training method of text classification model, text classification method and equipment | |
US11769054B2 (en) | Deep-learning-based system and process for image recognition | |
US8233726B1 (en) | Image-domain script and language identification | |
CN110807314A (en) | Text emotion analysis model training method, device and equipment and readable storage medium | |
CN108229481B (en) | Screen content analysis method and device, computing equipment and storage medium | |
CN112784581B (en) | Text error correction method, device, medium and electronic equipment | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
US11658989B1 (en) | Method and device for identifying unknown traffic data based dynamic network environment | |
CN112883721B (en) | New word recognition method and device based on BERT pre-training model | |
CN115617955B (en) | Hierarchical prediction model training method, punctuation symbol recovery method and device | |
CN114997169B (en) | Entity word recognition method and device, electronic equipment and readable storage medium | |
CN111444349A (en) | Information extraction method and device, computer equipment and storage medium | |
CN116304042A (en) | False news detection method based on multi-modal feature self-adaptive fusion | |
CN110414229B (en) | Operation command detection method, device, computer equipment and storage medium | |
CN114387602B (en) | Medical OCR data optimization model training method, optimization method and equipment | |
CN115858776A (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN114996451A (en) | Semantic category identification method and device, electronic equipment and readable storage medium | |
KR102331440B1 (en) | System for text recognition using neural network and its method | |
CN114758330A (en) | Text recognition method and device, electronic equipment and storage medium | |
CN115687607A (en) | Text label identification method and system | |
CN112668343A (en) | Text rewriting method, electronic device and storage device | |
CN116150379B (en) | Short message text classification method and device, electronic equipment and storage medium | |
CN112287669B (en) | Text processing method and device, computer equipment and storage medium | |
CN114519357B (en) | Natural language processing method and system based on machine learning | |
CN116894092B (en) | Text processing method, text processing device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |