CN112883721A - Method and device for recognizing new words based on BERT pre-training model - Google Patents

Method and device for recognizing new words based on BERT pre-training model Download PDF

Info

Publication number
CN112883721A
CN112883721A CN202110165682.1A CN202110165682A CN112883721A CN 112883721 A CN112883721 A CN 112883721A CN 202110165682 A CN202110165682 A CN 202110165682A CN 112883721 A CN112883721 A CN 112883721A
Authority
CN
China
Prior art keywords
new
word
words
shallow
training model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110165682.1A
Other languages
Chinese (zh)
Other versions
CN112883721B (en
Inventor
邵德奇
石聪
关培培
朱经南
赵诗阳
冯超
李腾飞
段治平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Science And Technology Daily
Original Assignee
Science And Technology Daily
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Science And Technology Daily filed Critical Science And Technology Daily
Publication of CN112883721A publication Critical patent/CN112883721A/en
Application granted granted Critical
Publication of CN112883721B publication Critical patent/CN112883721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention provides a method and a device for recognizing new words based on a BERT pre-training model, which relate to the technical field of new word mining and comprise the steps of obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words; inputting new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases; extracting discrete characteristics of new words; and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, identifying correct new words, determining boundaries of the words through a shallow network of a BERT pre-training model, and further accurately identifying the correct new words.

Description

Method and device for recognizing new words based on BERT pre-training model
Technical Field
The invention relates to the technical field of new word mining, in particular to a new word recognition method and device based on a BERT pre-training model.
Background
With the rapid development of internet science and technology, some emerging words, namely 'new words', are often cast. In the current semantic recognition scenario, the meaning of a sentence cannot be correctly recognized because a new word in the sentence cannot be accurately recognized.
Disclosure of Invention
The invention aims to provide a method and a device for recognizing new words based on a BERT pre-training model, which are used for determining word boundaries through a shallow network of the BERT pre-training model so as to accurately recognize correct new words.
In a first aspect, an embodiment of the present invention provides a new word recognition method based on a BERT pre-training model, including:
obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
inputting the new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic feature vectors and lexical feature vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
extracting discrete features of the new words;
and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, and identifying correct new words.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of inputting the shallow dense vector and the discrete features into a DNN binary classification model to identify correct new words includes:
inputting the shallow dense vector and the discrete features into a DNN binary model;
judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word;
and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.
With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the method further includes:
and if the new word and the word are correct real words, adjusting the DNN two-classification model and the BERT pre-training model through the new word and the word feedback.
With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the method further includes:
and performing semantic recognition on the corpus information after the new words and expressions are recognized.
With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the step of performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word terms includes:
and segmenting and filtering the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, wherein the new word words are byte fragments with various preset byte lengths.
With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the shallow network includes a layer 2 and a layer 3 of the BERT pre-training model.
With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the discrete features include left and right information entropies, mutual information, and statistics tf-idf.
In a second aspect, an embodiment of the present invention further provides a new word recognition apparatus based on a BERT pre-training model, including:
the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
the extraction module is used for extracting discrete characteristics of the new words;
and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.
In a third aspect, an embodiment provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps of the method described in any one of the foregoing embodiments when executing the computer program.
In a fourth aspect, embodiments provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to carry out the steps of the method of any preceding embodiment.
The embodiment of the invention provides a BERT pre-training model-based new word recognition method and a device, wherein possible new words are recognized through an N-Gram word segmentation algorithm, a shallow network of the BERT pre-training model with a bidirectional self-attention network is introduced to determine syntactic characteristic vectors and lexical characteristic vectors of the new words, so that the boundary information of the new words is obtained, the discrete characteristics of the words and the shallow dense vectors comprising the syntactic characteristic vectors and the lexical characteristic vectors are input into a DNN two-classification model, and whether the new words are actually correct new words is judged, so that the purpose of accurately recognizing the new words is realized, and the semantic recognition application is more accurate.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a new word recognition method based on a BERT pre-training model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of functional modules of a new word recognition apparatus based on a BERT pre-training model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a hardware architecture of an electronic device according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the current semantic recognition process, the new words cannot be accurately recognized, so that the accurate meanings of the sentences containing the new words cannot be recognized, for example, the small liriots become a slashed youth. The word combination after the word segmentation by adopting the old word algorithm comprises the following steps: the meaning of the sentence cannot be accurately recognized during semantic recognition and analysis based on the word combination of the small plum, the standing word, the one-word, the slash and the youth. Therefore, accurate recognition of new words is particularly important in semantic recognition applications.
The extraction of new words is generally performed by extracting discrete features of words, and the new words extracted in this way include: the method is characterized in that small plum, small plum stand, and the like are formed into one, namely, a stroke cyan, a stroke magenta and a stroke cyan, and the like.
In the task of extracting new words, whether a word is a new word depends on the independence of the context, and the RNN recurrent neural network only concerns that the state of one moment is influenced by the previous moment, so that the RNN network is not suitable at the moment; and the new word extraction needs to pay attention to longer sentence information, and the CNN convolutional neural network usually only pays attention to local information, so the CNN is not accurate enough.
The inventor researches and discovers that the self-attention mechanism network can well solve the language bidirectional problem and the global information extraction problem, and the extracted word embedding vector can better represent whether a word can be independently formed into a new word. A Bert pre-training model of a bidirectional self-attention mechanism Network is introduced, and the self-attention mechanism Network has many advantages compared with the traditional Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and other Networks.
Based on this, the method and the device for recognizing the new words based on the BERT pre-training model provided by the embodiment of the invention determine the boundaries of the words through the shallow network of the BERT pre-training model, and further accurately recognize the correct new words.
To facilitate understanding of the embodiment, first, a detailed description is given to a new word recognition method based on a BERT pre-training model disclosed in the embodiment of the present invention.
Fig. 1 is a flowchart of a new word recognition method based on a BERT pre-training model according to an embodiment of the present invention.
Referring to fig. 1, the method includes the steps of:
and S102, obtaining the corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words.
The corpus information comprises at least one sentence, and each sentence is composed of words.
It should be noted that N-Gram is a word segmentation algorithm based on a large vocabulary continuous speech recognition language model, and the basic idea is to perform a sliding window operation with a size of N on the content in a text according to bytes, so as to form a byte fragment sequence with a length of N. Each byte segment (word) is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset new word threshold value to form a key gram list, namely a vector feature space of the text, each gram in the list is a feature vector dimension, and the list comprises a plurality of new word words.
For example, the step S102 may further specifically include an implementation manner that, through an N-Gram word segmentation algorithm, the corpus information is segmented into words with various preset byte lengths, and then the words are filtered, that is, old words existing in an old lexicon and an old lexical method are filtered, and then a plurality of new word words are generated through filtering, where the new word words are byte segments with various preset byte lengths. Preferably, the maximum length of the preset byte does not exceed 4, i.e. the preset byte length comprises 1-4. Thus, the example sentence described above can be segmented into little lie, lie, one, italic cyan, italic young, and the like. However, the divided new words may be real new words or may not actually form words, and therefore, in order to accurately identify correct new words, the new words are verified through the following steps.
And S104, inputting the new words into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein the BERT pre-training model is introduced with a bidirectional self-attention network, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words, and the shallow dense vectors are used for identifying boundary information of the new words.
In some embodiments, the shallow network includes layers 2 and 3 of the BERT pre-trained model, referring to the lower network structure of the BERT pre-trained model.
The BERT pre-training model introduced into the bidirectional self-attention network can define the boundaries of the new words and phrases, and therefore the correct authenticity of the new words and phrases can be conveniently identified. It should be noted that the technology of introducing the multi-directional self-attention network into the BERT pre-training model can be implemented by those skilled in the art.
And step S106, extracting discrete characteristics of the new words.
The discrete features include left and right information entropy, mutual information, statistical values tf-idf and the like.
As an alternative embodiment, the discrete features are extracted from the new word and term according to a statistical algorithm, and are used for characterizing some characteristics of the new word and term, so that the DNN binary classification model can accurately classify the new word and term.
And step S108, inputting the shallow dense vectors and the discrete features into the DNN two-classification model, and identifying correct new words.
In a preferred embodiment of practical application, possible new word and expression are identified through an N-Gram word segmentation algorithm, a shallow network of a BERT pre-training model with a bidirectional self-attention network is introduced to determine a syntactic characteristic vector and a lexical characteristic vector of the new word and expression, so that the boundary information of the new word and expression is obtained, discrete characteristics of the word and a shallow dense vector comprising the syntactic characteristic vector and the lexical characteristic vector are input into a DNN two-class model, whether the new word and expression is actually a correct new word or not is judged, and the purpose of accurately identifying the new word and expression is achieved, so that the semantic identification application is more accurate.
In some embodiments, step S108 in the above embodiments may also be specifically implemented by the following steps, which specifically include:
step 1.1), inputting the shallow dense vectors and the discrete features into a DNN two-classification model.
Wherein, the Deep Neural Networks (DNN) two-classification model judges the real correctness of the input new words based on the characteristics and the boundary information of the new words.
As an alternative embodiment, the DNN binary model may identify the authenticity of a single new word per input, or, alternatively, identify the authenticity of each of several new words per batch input as a whole.
And step 1.2), judging whether the new word is a correct real word or not according to an output result, wherein the output result comprises the probability that the new word is the correct real word.
And step 1.3), if the probability that the new word is the correct true word is greater than the preset probability value, the new word is the correct true word.
Here, the output result of each new word term includes its belonging category and corresponding probability label, for example, the new word term "slash youth" belongs to the correct true word category, the corresponding probability may be eighty percent, the new word term "slash youth" belongs to the correct true word category, the corresponding probability may be forty percent, and so on. And setting a preset probability value which accords with the correct real word according to actual requirements or user customization, wherein if seventy percent, namely the preset probability value is exceeded, the new word 'slash youth' is identified as the correct real word, and the new word 'slash youth' is not the correct real new word.
In some embodiments, in order to more accurately identify the sentence semantics including the new word, the method provided in the above embodiments further includes:
and 2.1) if the new word is a correct real word, adjusting the DNN two-classification model and the BERT pre-training model through the new word feedback.
In some embodiments, the lexicon, the DNN binary classification model, the BERT pre-training model and the like are adjusted by the identified correct new words, which is equivalent to that the new words become conventional old words in the current lexicon, the words can be correctly divided in the semantic recognition application and subjected to semantic recognition, and in the subsequent new word recognition process, the words do not need to be taken as new words and the correctness is judged. In another optional embodiment, the DNN dichotomy model and the BERT pre-training model which are subjected to the new word feedback adjustment are applied to another new word recognition scene, the new word may not be stored in a lexicon in the scene, that is, the new word is still recognized through an N-Gram word segmentation algorithm, and the DNN dichotomy model and the BERT pre-training model are input, so that the DNN dichotomy model and the BERT pre-training model can more accurately know that the new word is correct and real, the probability value of the output result is higher, and the judgment on the correctness of the new word is more accurate.
In some embodiments, the method provided in the above embodiments further comprises:
and 3,1) performing semantic recognition on the corpus information after the new words and phrases are recognized.
It can be understood that in the subsequent semantic recognition scene, the corpus information divided by the real and correct new words and expressions can be subjected to more accurate semantic recognition.
As shown in fig. 2, an embodiment of the present invention provides a new word recognition apparatus based on a BERT pre-training model, including:
the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
the extraction module is used for extracting discrete characteristics of the new words;
and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.
In some embodiments, the identifying module is further specifically configured to input the shallow dense vector and the discrete features into a DNN classification model; judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word; and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.
In some embodiments, the apparatus further comprises an adjusting module configured to adjust the DNN binary model and the BERT pre-training model via the new word feedback if the new word is a probability of a correct true word.
In some embodiments, the apparatus further includes a semantic recognition module, configured to perform semantic recognition on the corpus information after the new word and the new word are recognized.
In some embodiments, the obtaining module is further specifically configured to segment and filter the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, where the new word words are byte segments with a plurality of preset byte lengths.
In some embodiments, the shallow network includes layers 2 and 3 of the BERT pre-trained model.
In some embodiments, the discrete features include left and right entropy, mutual information, and statistics tf-idf.
In this embodiment, the electronic device may be, but is not limited to, a Computer device with analysis and processing capabilities, such as a Personal Computer (PC), a notebook Computer, a monitoring device, and a server.
As an exemplary embodiment, referring to fig. 3, the electronic device 120 includes a communication interface 121, a processor 122, a memory 123, and a bus 124, wherein the processor 122, the communication interface 121, and the memory 123 are connected by the bus 124; the memory 123 is used for storing a computer program that supports the processor 122 to execute the image sharpening method, and the processor 122 is configured to execute the program stored in the memory 123.
A machine-readable storage medium as referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.
The non-volatile medium may be non-volatile memory, flash memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, dvd, etc.), or similar non-volatile storage medium, or a combination thereof.
It can be understood that, for the specific operation method of each functional module in this embodiment, reference may be made to the detailed description of the corresponding step in the foregoing method embodiment, and no repeated description is provided herein.
The computer-readable storage medium provided in the embodiments of the present invention stores a computer program, and when executed, the computer program code may implement the new word recognition method based on the BERT pre-training model according to any of the above embodiments, and specific implementation may refer to method embodiments, which are not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims (10)

1. A new word recognition method based on a BERT pre-training model is characterized by comprising the following steps:
obtaining corpus information, and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
inputting the new words and phrases into a shallow network of a BERT pre-training model, and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic feature vectors and lexical feature vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
extracting discrete features of the new words;
and inputting the shallow dense vectors and the discrete features into a DNN two-classification model, and identifying correct new words.
2. The method for identifying new words based on a BERT pre-training model as claimed in claim 1, wherein the step of inputting the shallow dense vectors and the discrete features into a DNN two-classification model to identify correct new words comprises:
inputting the shallow dense vector and the discrete features into a DNN binary model;
judging whether the new word is a correct true word or not according to an output result, wherein the output result comprises the probability that the new word is the correct true word;
and if the probability that the new word is the correct real word is greater than the preset probability value, the new word is the correct real word.
3. The method of claim 2, further comprising:
and if the new word and the word are correct real words, adjusting the DNN two-classification model and the BERT pre-training model through the new word and the word feedback.
4. The method of claim 1, further comprising:
and performing semantic recognition on the corpus information after the new words and expressions are recognized.
5. The method for recognizing new words based on a BERT pre-training model as claimed in claim 1, wherein the step of performing word segmentation processing on the corpus information by an N-Gram word segmentation algorithm to obtain a plurality of new word words comprises:
and segmenting and filtering the corpus information through an N-Gram word segmentation algorithm to generate a plurality of new word words, wherein the new word words are byte fragments with various preset byte lengths.
6. The method of claim 1, wherein the shallow network comprises layers 2 and 3 of the BERT pre-trained model.
7. The method of claim 1, wherein the discrete features include left-right entropy, mutual information, and statistics tf-idf.
8. A new word recognition device based on a BERT pre-training model is characterized by comprising:
the acquisition module is used for acquiring the corpus information and performing word segmentation processing on the corpus information through an N-Gram word segmentation algorithm to obtain a plurality of new word words;
the output module is used for inputting the new words and phrases into a shallow network of a BERT pre-training model and outputting shallow dense vectors, wherein a bidirectional self-attention network is introduced into the BERT pre-training model, the shallow dense vectors comprise syntactic characteristic vectors and lexical characteristic vectors of the new words and phrases, and the shallow dense vectors are used for identifying boundary information of the new words and phrases;
the extraction module is used for extracting discrete characteristics of the new words;
and the recognition module is used for inputting the shallow dense vector and the discrete features into a DNN two-classification model and recognizing correct new words.
9. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, characterized in that a computer program is stored in the readable storage medium, which computer program, when executed, implements the method of any of claims 1-7.
CN202110165682.1A 2021-01-14 2021-02-06 New word recognition method and device based on BERT pre-training model Active CN112883721B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021100511149 2021-01-14
CN202110051114 2021-01-14

Publications (2)

Publication Number Publication Date
CN112883721A true CN112883721A (en) 2021-06-01
CN112883721B CN112883721B (en) 2024-01-19

Family

ID=76055944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110165682.1A Active CN112883721B (en) 2021-01-14 2021-02-06 New word recognition method and device based on BERT pre-training model

Country Status (1)

Country Link
CN (1) CN112883721B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343688A (en) * 2021-06-22 2021-09-03 南京星云数字技术有限公司 Address similarity determination method and device and computer equipment
CN114841155A (en) * 2022-04-21 2022-08-02 科技日报社 Intelligent theme content aggregation method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460162A (en) * 2020-04-11 2020-07-28 科技日报社 Text classification method and device, terminal equipment and computer readable storage medium
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words
CN111581374A (en) * 2020-05-09 2020-08-25 联想(北京)有限公司 Text abstract obtaining method and device and electronic equipment
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN112214601A (en) * 2020-10-21 2021-01-12 厦门市美亚柏科信息股份有限公司 Social short text sentiment classification method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460162A (en) * 2020-04-11 2020-07-28 科技日报社 Text classification method and device, terminal equipment and computer readable storage medium
CN111581374A (en) * 2020-05-09 2020-08-25 联想(北京)有限公司 Text abstract obtaining method and device and electronic equipment
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words
CN112214601A (en) * 2020-10-21 2021-01-12 厦门市美亚柏科信息股份有限公司 Social short text sentiment classification method and device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343688A (en) * 2021-06-22 2021-09-03 南京星云数字技术有限公司 Address similarity determination method and device and computer equipment
CN114841155A (en) * 2022-04-21 2022-08-02 科技日报社 Intelligent theme content aggregation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112883721B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
US11769054B2 (en) Deep-learning-based system and process for image recognition
US8233726B1 (en) Image-domain script and language identification
CN110807314A (en) Text emotion analysis model training method, device and equipment and readable storage medium
CN108229481B (en) Screen content analysis method and device, computing equipment and storage medium
CN112784581B (en) Text error correction method, device, medium and electronic equipment
CN110968725B (en) Image content description information generation method, electronic device and storage medium
US11658989B1 (en) Method and device for identifying unknown traffic data based dynamic network environment
CN112883721B (en) New word recognition method and device based on BERT pre-training model
CN115617955B (en) Hierarchical prediction model training method, punctuation symbol recovery method and device
CN114997169B (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN111444349A (en) Information extraction method and device, computer equipment and storage medium
CN116304042A (en) False news detection method based on multi-modal feature self-adaptive fusion
CN110414229B (en) Operation command detection method, device, computer equipment and storage medium
CN114387602B (en) Medical OCR data optimization model training method, optimization method and equipment
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN114996451A (en) Semantic category identification method and device, electronic equipment and readable storage medium
KR102331440B1 (en) System for text recognition using neural network and its method
CN114758330A (en) Text recognition method and device, electronic equipment and storage medium
CN115687607A (en) Text label identification method and system
CN112668343A (en) Text rewriting method, electronic device and storage device
CN116150379B (en) Short message text classification method and device, electronic equipment and storage medium
CN112287669B (en) Text processing method and device, computer equipment and storage medium
CN114519357B (en) Natural language processing method and system based on machine learning
CN116894092B (en) Text processing method, text processing device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant