CN111563143B - Method and device for determining new words - Google Patents

Method and device for determining new words Download PDF

Info

Publication number
CN111563143B
CN111563143B CN202010696059.4A CN202010696059A CN111563143B CN 111563143 B CN111563143 B CN 111563143B CN 202010696059 A CN202010696059 A CN 202010696059A CN 111563143 B CN111563143 B CN 111563143B
Authority
CN
China
Prior art keywords
word
words
determining
candidate words
original candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010696059.4A
Other languages
Chinese (zh)
Other versions
CN111563143A (en
Inventor
刘凡平
沈振雷
陈慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai 2345 Network Technology Co ltd
Original Assignee
Shanghai 2345 Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai 2345 Network Technology Co ltd filed Critical Shanghai 2345 Network Technology Co ltd
Priority to CN202010696059.4A priority Critical patent/CN111563143B/en
Publication of CN111563143A publication Critical patent/CN111563143A/en
Application granted granted Critical
Publication of CN111563143B publication Critical patent/CN111563143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for determining new words, which is used for determining new words based on a deep neural network and comprises the following steps: a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified; b: training the original candidate words based on a BERT model, and determining vectorized candidate words; c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2A neuron of { overspread }; d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words. The invention can largely determine new words appearing in the current society through the intelligent operation of big data of a computer in the whole process based on the search target and range, and expand the word bank of the input method.

Description

Method and device for determining new words
Technical Field
The invention belongs to the field of computer technology application, and particularly relates to a method and a device for determining new words.
Background
With the continuous progress of society and the popularization of the internet in people's daily life, the communication between people is not limited to face, but more effective communication is realized through the network, in such a diversified and fast-paced modern society, things are big and small at any moment, and the generation of new words is a product derived from the development of the modernization, which brings people into more effective and interesting communication, such as the new words ' jiongattitude ', ' kaoyou ', ' pigeon ', ' old driver ' and the like appearing in recent years, and the meanings and scenes drawn by the new words are gradually and widely accepted along with the wide application of people in communication.
However, as some third party platforms or systems, the third party platforms or systems are often required to be more suitable for the use habits and interests of users, so as to better provide high-quality services for users, and with the rapid development of the internet, the existing new words are more endless, and even for the third party platforms or systems, some troubles and influences are often brought to the users because some new words cannot be identified, how to better track over with the new words of the modern society becomes a technical problem to be solved by some current merchants, and how to obtain the new words appearing recently in a large amount and accurately is the most important technical problem at present.
The discovery of new words is generally considered from the degrees of freedom and the degree of solidification, the former has a relatively rich context, the latter also needs to satisfy a certain condition in the interior of the former, the interior of the word needs to be relatively stable or the degree of internal solidification is relatively high, and at present, a technical scheme capable of solving the technical problems is not provided, and specifically, a method and a device for determining the new words are not provided.
Disclosure of Invention
In view of the technical defects in the prior art, an object of the present invention is to provide a method and an apparatus for determining a new word, according to an aspect of the present invention, a method for determining a new word is provided, which determines a new word based on a deep neural network, and includes the following steps:
a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;
b: training the original candidate words based on a BERT model, and determining vectorized candidate words;
c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2The neuron of (a), wherein,
when y is1Is 1, y2When the word is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is1Is 0, y2When the word number is 1, determining that the original candidate word corresponding to the vectorization candidate word is not a word;
d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.
Preferably, in the step a, the text content is determined as the text to be authenticated by:
-a byte stream;
-a character stream; or
-a word stream.
Preferably, in the step a, the generation of the original candidate word based on the N-Gram algorithm is determined by:
a 1: performing sliding window operation with the size of N on a text to be identified to form character strings with the length of N, wherein each character string is called a gram, and 1 < N < M, and M is the number of the character strings of the original candidate word;
a 2: and determining all character strings with the length of N as original candidate words.
Preferably, in the step b, the BERT model is determined through a large amount of text and based on the word, semantic information of the word, and position information of the word.
Preferably, in the step b, the vectorized candidate word is a vector of 768 dimensions.
Preferably, in the step c, the deep neural network model is determined by: training the deep neural network model by words corresponding to the positive case characteristic vectors and non-words corresponding to the negative case characteristic vectors according to the data quantity with the same proportion, and adjusting model parameters through a back propagation algorithm to enable the deep neural network model to have the word discrimination capability,
wherein the positive case feature vector corresponds to a neuron labeled {1, 0 } and the negative case feature vector corresponds to a neuron labeled { 0,1 }.
Preferably, the back propagation algorithm tuning model parameters are determined by:
Figure 879752DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 341957DEST_PATH_IMAGE002
the new weight is represented by the new weight,
Figure 299549DEST_PATH_IMAGE003
the weights in the previous iteration are represented,
Figure 505402DEST_PATH_IMAGE004
which is indicative of the rate of learning,
Figure 996164DEST_PATH_IMAGE005
indicating the back-propagated error resizing.
Preferably, in the step c, outputting the plurality of vectorization candidate words by:
Figure 160429DEST_PATH_IMAGE006
wherein, in the step (A),
Figure 605317DEST_PATH_IMAGE007
is the output value of the prediction and is,
Figure 614861DEST_PATH_IMAGE008
is the desired output and L refers to the cross entropy loss function value loss.
Preferably, the database is a standard thesaurus.
According to another aspect of the present invention, there is provided a new word determination apparatus which determines a new word based on a deep neural network by using the determination method, including:
a first processing device: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;
the first determination means: training the original candidate words based on a BERT model, and determining vectorized candidate words;
a second processing device: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2The number of the neurons in the neuron is equal to the number of the neurons in the neuron,
second determining means: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.
Preferably, the first processing means includes:
a third processing device: carrying out sliding window operation with the size of N on a text to be identified to form a character string with the length of N;
third determining means: and determining all character strings with the length of N as original candidate words.
The invention discloses a method for determining new words, which is used for determining new words based on a deep neural network and comprises the following steps: a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified; b: training the original candidate words based on a BERT model, and determining vectorized candidate words; c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2H, wherein when y is1Is 1, y2When the word is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is1Is 0, y2When the word number is 1, determining that the original candidate word corresponding to the vectorization candidate word is not a word; d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words. The method combines an N-Gram algorithm and a BERT model to determine and vectorize words in a text, purposefully adopts a creative deep neural network to output neurons based on a judgment standard, finally matches candidate words determined as the words with all the words in a database, if no such words exist, the candidate words are new words, the method can greatly determine the new words appearing in the current society based on a search target and a search range through intelligent operation of big data of a computer in the whole process, and expands an input method word bank.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a schematic flow chart illustrating a method for determining new words according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a specific process of generating a plurality of original candidate words based on an N-Gram algorithm and a text to be authenticated according to a first embodiment of the present invention;
FIG. 3 is a block diagram of a device for determining new words according to another embodiment of the present invention; and
fig. 4 is a diagram illustrating a deep neural network-based neuron determination according to a second embodiment of the present invention.
Detailed Description
In order to better and clearly show the technical scheme of the invention, the invention is further described with reference to the attached drawings.
Fig. 1 shows a specific flowchart of a method for determining a new word according to an embodiment of the present invention, and the present invention will further describe in detail a technical implementation of the method for determining a new word by using fig. 1 and fig. 2, and the present invention combines an N-Gram algorithm and a BERT model to determine and vectorize words in a text, and determines a new word based on a deep neural network, specifically, the method includes the following steps:
first, step S101 is performed to generate a plurality of original candidate words based on the N-Gram algorithm and the text to be identified, and those skilled in the art understand that N-Gram (sometimes also referred to as N-Gram) is a very important concept in natural language processing, and in NLP, one can predict or evaluate whether a sentence is reasonable or not by using N-Gram based on a certain corpus. On the other hand, another role of the N-Gram is to evaluate the degree of difference between two strings. This is a commonly used approach in fuzzy matching. In the present invention, the N-Gram algorithm is a control method for effectively segmenting text content to obtain a plurality of required data, which is a currently common prior art, and a text to be identified is used for segmented text content, and the original candidate word is required data that needs to be determined whether to be a word or not.
Further, in step S101, the text content is determined as the text to be authenticated by means of a byte stream, where the byte stream refers to a stream in which the most basic unit of transmission data is bytes during transmission, and the byte stream is composed of bytes, and in most cases, the bytes are the smallest basic unit of data, and accordingly, the text content may also be determined as the text to be authenticated by means of a character stream, where the character stream processes 2 bytes of Unicode characters, and the character stream processes characters, character arrays or character strings, respectively, and the byte stream processing unit processes bytes, operation bytes and byte arrays. Therefore, the character stream is formed by converting bytes into characters with Unicode characters of bytes as units by a Java virtual machine, the byte stream can be used for any type of objects including binary objects, and the character stream can only process the characters or character strings; the byte stream provides the functionality to handle any type of IO operation, but it cannot directly handle Unicode characters, as would a character stream.
In yet another very specific embodiment, the text content may also be determined as the text to be identified by means of a word stream, in which embodiment the word stream is a word stream consisting of several words to find long new words consisting of words. For example: "Beijing" is a word, and "university" is a word, which when combined may also be a new word. The word stream is for example: the Beijing university notices that a new word resulting from a combination of words can be found by the word stream, provided that "Beijing university" is a new word that has not been found. The idea of N-Gram is to decide on these words: beijing university, university student recruitment, student recruitment notification, Beijing university student recruitment, university student recruitment notification, and Beijing university student recruitment notification.
Then, step S102 is performed, a plurality of original candidate words are trained based on a BERT model, and a plurality of vectorized candidate words are determined, BERT (bidirectional Encoder retrieval from transformations) is a general pre-training language representation model with a very good effect, a method for obtaining a pre-training representation model at present mainly includes a Feature-based (Feature-based) method or a Fine-tuning (Fine-tuning) method, a word vector generation process adopts the BERT model, and the BERT model inputs: token entries, segmentation entries and position entries, which are mainly applied to a feature-based method in the present invention, i.e., converting an input word into a word vector, sequentially inputting the word vector into lstm in the forward and reverse directions, respectively, to obtain corresponding outputs, stacking L layers, and linearly combining the corresponding feature outputs of the two parts, to obtain a pre-training representation model.
Further, in the step S102, a BERT model is determined through a large amount of text and based on the word, semantic information of the word, and position information of the word, in such an embodiment, an implementation manner for establishing the BERT model is disclosed, and since establishing the BERT model is a prior art, how to determine the BERT model through a large amount of text and based on the word, semantic information of the word, and position information of the word in a targeted manner is a specific and targeted implementation scheme of the present invention, similarity of generated vectors is close between similar words.
Those skilled in the art will appreciate that the BERT model employs a Transformer and when processing a word, can also take into account words preceding and following the word to derive its meaning in context. And transmitting the characters of the acquired text, the sentences of the characters and the position information as input to a BERT model through new texts acquired on the Internet every day, and obtaining the trained BERT model which is the model of the words embedding through iteration until convergence is stable. The BERT model randomly selects 15% of words in the corpus, then 80% of the words are replaced by Mask masks, 10% of the words are randomly changed into another word, the remaining 10% of the words are kept unchanged, and then the model is required to correctly predict the selected words, so that the semantic understanding of the words is achieved.
Further, in the step S102, the vectorized candidate word is a vector with 768 dimensions, in such an embodiment, an N-Gram algorithm is used first, and then a Bert model is used to generate the vectorized candidate word, where the vectorized candidate word has 768 dimensions.
Then, the liquidThen, step S103 is entered, and a plurality of vectorization candidate words are output as a mark y based on the deep neural network1,y2The neuron of (1) when y1Is 1, y2When the word is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is1Is 0, y2When the word is 1, determining that an original candidate word corresponding to the vectorization candidate word is not a word, labeling vectorization representation of the candidate word based on a deep neural network, wherein the labeling form is One-Hot coding, combining step S101 and step S102, in a preferred embodiment, the text to be identified is ' good learning ', determining One original candidate word as ' learning ' by using an N-Gram algorithm, generating vectorization candidate words by a Bert model, wherein the vectorization candidate words have 768 dimensions, and outputting a plurality of vectorization candidate words to be labeled as { y ' based on the deep neural network1,y2When the original candidate word is 'learning', the neuron marked as {1, 0 } is output, and the 'learning' is determined to be a word.
Further, in the step S103, a deep neural network model is determined by: training a deep neural network model by words corresponding to positive example feature vectors and non-words corresponding to negative example feature vectors according to data volume of the same proportion, and adjusting model parameters through a back propagation algorithm to enable the deep neural network model to have word judgment capacity, wherein the positive example feature vectors correspond to neurons marked as {1, 0 } and the negative example feature vectors correspond to neurons marked as { 0,1 }.
Those skilled in the art understand that the input is divided into two parts, positive and negative examples, the positive example being a feature vector of a normal word, such as a 768-dimensional vector of "computer"; the negative example is a vector of a non-normal word, for example, the 'I eat' is not a 768-dimensional feature vector of a word, the model is trained through the positive example and the negative example with the same proportion, namely, the positive example and the negative example respectively occupy 50% of data volume, then the positive example is marked as [1,0], the negative example is marked as [0,1], and model parameters are adjusted through a back propagation algorithm, so that the model has the word judgment capability.
Further, the back propagation algorithm tuning model parameters are determined by:
Figure 727174DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 265603DEST_PATH_IMAGE010
the new weight is represented by the new weight,
Figure 463366DEST_PATH_IMAGE011
the weights in the previous iteration are represented,
Figure 807760DEST_PATH_IMAGE012
which is indicative of the rate of learning,
Figure 774579DEST_PATH_IMAGE013
indicating the back-propagated error resizing.
Further, fig. 4 shows a schematic diagram of determining neurons based on a deep neural network according to a second embodiment of the present invention, and in combination with the embodiment shown in fig. 4, those skilled in the art understand that a process of identifying words adopts a supervised learning manner, and a classification network model structure is a fully connected deep neural network through a classification problem solution. The final purpose of the invention is to judge the probability of whether the word is a real word or not based on the word formed by inputting a plurality of words. As shown in fig. 4, the input to the deep neural network model is a 768-dimensional vector of words (i.e., n =768 neurons); the output of the deep neural network model is 2 neurons, namely [1,0] represents a word, and [0,1] represents not a word; the loss function of the deep neural network model is a cross entropy loss function, and the optimization method of the deep neural network model is Adam.
Further, in the step S103, outputting a plurality of vectorization candidate words by:
Figure 248023DEST_PATH_IMAGE014
wherein, in the step (A),
Figure 464241DEST_PATH_IMAGE015
is the output value of the prediction and is,
Figure 18850DEST_PATH_IMAGE016
is the desired output, L refers to the cross entropy loss function value loss, which we know in the two-class problem model: such as Logistic Regression, Neural Network, etc., the label of the real sample is [0,1]]Respectively, a negative class and a positive class. The model will eventually output a probability value, usually via a Sigmoid function, that reflects the probability of predicting a positive class: the greater the probability, and the expression and graph of Sigmoid function are as follows: g(s) =11+ e-sg(s) = \ frac {1} {1+ e ^ s } }, where s is the output of the layer above the model, and the Sigmoid function has the following characteristics: s =0, g(s) = 0.5; s>>At 0, g ≈ 1, s<<At 0, g ≈ 0. Obviously, g(s) maps the linear output of the previous stage to [0, 1%]In numerical probability in between. Where g(s) is the model prediction output in the cross entropy formula. The prediction output, i.e. the output of the Sigmoid function, characterizes the probability that the current sample label is 1: y = P (y =1| x) \ hat y = P (y =1| x), it is clear that the probability that the current sample label is 0 can be expressed as: 1-y = P (y =0| x)1- \ hat y = P (y =0| x), if the above two cases are integrated together from the perspective of maximum likelihood, when the true sample label y =0, the first term of the above equation is 1, and the probability equation turns into: p (y =0| x) = 1-yP (y =0| x) =1- \ hat y, when the true sample label y =1, the second term of the above equation is 1, and the probability equation is converted to: p (y =1| x) = yP (y =1| x) = \ hat y, and in both cases the probability expression is identical to the previous one, except that we integrate the two cases together. We want the probability P (y | x) to be as large as possible. First, we introduce a log function to P (y | x),since log operations do not affect the monotonicity of the function itself. We prefer that the larger the log P (y | x) the better, and vice versa, as long as the negative value of log P (y | x) -log P (y | x) is smaller. We can introduce a Loss function and let Loss = -log P (y | x). The loss function is then obtained as: l = - [ ylog a + (1-y) log (1-a)]. Those skilled in the art will appreciate that cross-entropy loss can be very sensitive to differences in classification effect and that accurate quantization can be achieved.
Finally, step S104 is performed, one or more original candidate words determined as words are matched in a database, if the one or more original candidate words are not in the database, the one or more original candidate words are determined as new words, it is understood by those skilled in the art that the database is a standard word library, and in combination with steps S101 to S104, first, the system prepares the standard word library, searches for a large number of articles, generates original candidate words based on step S101, and in this step, generates a plurality of original candidate words in combination with N-Gram, and then determines the probability that a candidate word is a word, at this time, a "deep neural network" is used for determining, for example: the original candidate word of 'good learning' is combined with step S102, a Bert model is used to generate a vectorization candidate word with 768 dimensions, then, the deep neural network is adopted to judge through step S103, finally, 2 vectors [1,0] and [0,1], [1,0] are generated to represent that the word is, and [0,1] represents that the word is not, further, in step S104, the word is searched in a standard lexicon, if the word exists in the standard lexicon, no processing is performed, if the word does not exist in the standard lexicon, the word is marked as a new word, and the new word is added into the standard lexicon.
In a preferred embodiment, after the new word is added to the standard word stock, the newly added new word is compared with other words in the standard word stock to determine the domain to which the new word belongs, and the semantic meaning is known, for example: the new word of 'good learning' is compared with the word of 'learning' in the word stock, the vector included angle is calculated, if the included angle is smaller, the description fields are approximately the same, the word senses are similar, and the field to which the word belongs can be approximately judged.
Fig. 2 shows a detailed flowchart of a first embodiment of the present invention, where the flowchart is a specific flowchart of generating a plurality of original candidate words based on an N-Gram algorithm and a text to be evaluated, fig. 2 is a detailed step of step S101 of the present invention, and in step S101, the original candidate words generated based on the N-Gram algorithm are determined as follows:
firstly, step S1011 is entered, and a sliding window operation with a size of N is performed on the text to be authenticated, so as to form a character string with a length of N, each character string is called a gram, where 1 < N < M, and M is the number of character strings of the original candidate word, and those skilled in the art understand that if the original candidate word is 8, N may be preferably 2, 3, 4, 5, 6, 7, in such an embodiment, several candidate words are generated based on a bigram, a triplet, a quadruplet, a quingram, a hexagram, and a heptagram, for example, the original candidate word is "has no intention to apply to people", and the bigram is "has place", "has not done, not want to apply", "does not apply to people" according to the above dividing manner; the triplets are "none", "not wanted", "wanted on a person"; the four-tuples are "unwanted", and "unwanted" for a person; the quintuple is "Do not you want to do it by oneself", "Do not you want to do it", "Do not want to do it to a person"; the six-tuple is "do not apply to oneself", "do not apply to you", "do not apply to man"; the seven-tuple is 'no application to oneself' and 'no application to people' and further, each candidate word is judged whether the candidate word belongs to a word or not according to the mode of identifying the word, if the candidate word belongs to a word, the existing word is filtered, and if the candidate word does not belong to the existing word bank, the candidate word is found as a new word.
Finally, step S1012 is performed, all character strings with length N are determined as original candidate words, and in this step, all candidate words generated by binary, ternary, quaternary, quinary, hexagonal, and heptatuple are determined as original candidate words in combination with step S1011.
The technical personnel in the field understand that the invention combines the N-Gram algorithm, the BERT model and the neural network to improve the discovery rate of new words, in particular, the N-Gram can improve the coverage rate of word discovery, the Bert can discover semantic relations and relations among words, the neural network algorithm can efficiently discover new words by means of the classification capability of data fitting, after the three are combined, research data shows that the new words are discovered to be 40 words per day, the accuracy is 70-80 percent, and the technical effects which can be achieved by the technical scheme recorded by the invention can not be achieved no matter the three technologies are used for new word discovery alone or any two-two combination of the three technologies, and the three technical characteristics are the technical characteristics which have step characteristics, are sequential and are closely connected, namely, the invention is the new word discovery realized on the basis of the technical scheme of the combination of the three technical characteristics, compared with the prior art, the method can greatly determine new words appearing in the current society through computer big data intelligent operation in the whole process based on the search target and range, expand the word bank of the input method, and efficiently acquire a large number of new words with high probability.
More specifically, compared with the prior art, the technical scheme that the N-Gram algorithm, the BERT model and the neural network are combined is adopted, the characteristics that the N-Gram algorithm obtains the diversity and the integrity of words in the big data are exerted, and the new words are found and determined in the big data without omission and are more accurate.
The BERT model combines the context of the vocabulary in the application, more comprehensively understands and positions the vocabulary function based on the words, the semantic information of the words and the position information of the words, and is used for outputting the final new word judgment result in cooperation with the neural networkVectorized candidate words are used as input, and training of the neural network is carried out, so that output of y is obtained1,y2In such an embodiment, if the threshold for determining whether a word is a new word is determined to be lower, the more the number of suspected new words can be obtained, but the overall accuracy may be reduced, and if the threshold for determining whether a word is a new word is determined to be higher, the less the number of suspected new words can be obtained, but the overall accuracy may be improved. More specifically, no matter what the threshold value of the probability of determining whether a word is a new word is determined, in a preferred embodiment, the application can quickly find the new word only by putting the latest web article into the technical scheme recorded in the application without changing the model algorithm or changing the model parameters.
By combining the embodiment, the preparation of the labeled data set can be reduced, in the prior art, a large amount of manpower and financial resources are consumed frequently due to the data labeling for preprocessing the discovery of the new words, and research results show that about half of the labeled data set can be reduced, so that the consumption of manpower and financial resources is reduced, and the efficiency of discovering the new words is improved.
Fig. 3 is a schematic block connection diagram of a new word determining apparatus according to another embodiment of the present invention, and those skilled in the art understand that the new word determining apparatus determines a new word based on a deep neural network by using the determining method, including: the first processing device 1: based on the N-Gram algorithm and the text to be identified, a plurality of original candidate words are generated, and the working principle of the first processing device 1 may refer to the step S101, which is not described herein again.
Further, the determining means of the new word further comprises the first determining means 2: based on the BERT model, a plurality of original candidate words are trained, and a plurality of vectorized candidate words are determined, and the working principle of the first determining device 2 may refer to the step S102, which is not described herein again.
Further, the determining means of the new word further comprises second processing means 3: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2For the neurons of, the working principle of the second processing device 3 can refer to the foregoing step S103, which is not described herein again.
Further, the determining means of the new word further comprises second determining means 4: one or more original candidate words determined as words are matched in the database, and if the one or more original candidate words are not present in the database, one or more original candidate words are determined as new words, and the working principle of the second determining device 4 may refer to the step S104, which is not described herein again.
Further, the first processing device 1 comprises a third processing device 11: the operation of the sliding window with the size of N is performed on the text to be identified to form the character string with the length of N, and the working principle of the third processing device 11 may refer to the foregoing step S1011, which is not described herein again.
Further, the first processing device 1 further comprises a third determining device 12: the working principle of the third determining device 12 may refer to the step S1012, and is not described herein again.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (10)

1. A method for determining a new word based on a deep neural network is characterized by comprising the following steps:
a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;
b: determining a BERT model through a large amount of texts and based on characters, semantic information of the characters and position information of the characters, training a plurality of original candidate words by a characteristic-based method, and determining a plurality of vectorized candidate words;
c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2The neuron of (a), wherein,
when y is1Is 1, y2When the word number is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is1Is 0, y2When the word number is 1, determining that the original candidate word corresponding to the vectorization candidate word is not a word;
d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.
2. The determination method according to claim 1, wherein in the step a, text content is determined as the text to be authenticated by any one of the following methods:
-a byte stream;
-a character stream; or
-a word stream.
3. The determination method according to claim 1, wherein in the step a, the generation of the original candidate word based on the N-Gram algorithm is determined by:
a 1: performing sliding window operation with the size of N on a text to be identified to form character strings with the length of N, wherein each character string is called a gram, and 1 < N < M, and M is the number of the character strings of the original candidate word;
a 2: and determining all character strings with the length of N as original candidate words.
4. The method of claim 1, wherein in step b, the vectorized candidate word is a 768-dimensional vector.
5. The determination method according to claim 1, wherein in the step c, the deep neural network model is determined by: training the deep neural network model by words corresponding to the positive case characteristic vectors and non-words corresponding to the negative case characteristic vectors according to the data quantity with the same proportion, and adjusting model parameters through a back propagation algorithm to enable the deep neural network model to have the word discrimination capability,
wherein the positive case feature vector corresponds to a neuron labeled {1, 0 } and the negative case feature vector corresponds to a neuron labeled { 0,1 }.
6. The determination method according to claim 5, wherein the back propagation algorithm tuning model parameters are determined by:
Figure 865353DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 782493DEST_PATH_IMAGE002
the new weight is represented by the new weight,
Figure 499914DEST_PATH_IMAGE003
the weights in the previous iteration are represented,
Figure 895123DEST_PATH_IMAGE004
which is indicative of the rate of learning,
Figure 647178DEST_PATH_IMAGE005
indicating the back-propagated error resizing.
7. The method of claim 1, wherein in step c, outputting the plurality of vectorized candidate words is performed by:
Figure 469641DEST_PATH_IMAGE006
wherein, in the step (A),
y is the predicted output value, a is the desired output, and L is the cross entropy loss function value loss.
8. The method of claim 1, wherein the database is a standard lexicon.
9. A new word determination apparatus that determines a new word based on a deep neural network by using the determination method according to any one of claims 1 to 8, comprising:
first treatment device (1): generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;
first determination means (2): training the original candidate words based on a BERT model, and determining vectorized candidate words;
second treatment device (3): outputting a plurality of vectorized candidate words as labeled y based on a deep neural network1,y2A neuron of { overspread };
second determination means (4): and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.
10. The determination device according to claim 9, characterized in that the first processing means (1) comprise:
third processing device (11): carrying out sliding window operation with the size of N on a text to be identified to form a character string with the length of N;
third determination means (12): and determining all character strings with the length of N as original candidate words.
CN202010696059.4A 2020-07-20 2020-07-20 Method and device for determining new words Active CN111563143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010696059.4A CN111563143B (en) 2020-07-20 2020-07-20 Method and device for determining new words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010696059.4A CN111563143B (en) 2020-07-20 2020-07-20 Method and device for determining new words

Publications (2)

Publication Number Publication Date
CN111563143A CN111563143A (en) 2020-08-21
CN111563143B true CN111563143B (en) 2020-11-03

Family

ID=72073933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010696059.4A Active CN111563143B (en) 2020-07-20 2020-07-20 Method and device for determining new words

Country Status (1)

Country Link
CN (1) CN111563143B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950265A (en) * 2020-08-25 2020-11-17 中国电子科技集团公司信息科学研究院 Domain lexicon construction method and device
CN112434512A (en) * 2020-09-17 2021-03-02 上海二三四五网络科技有限公司 New word determining method and device in combination with context
CN112364628B (en) * 2020-11-20 2022-04-15 创优数字科技(广东)有限公司 New word recognition method and device, electronic equipment and storage medium
CN112463969B (en) * 2020-12-08 2022-09-20 上海烟草集团有限责任公司 Method, system, equipment and medium for detecting new words of cigarette brand and product rule words
CN112883721B (en) * 2021-01-14 2024-01-19 科技日报社 New word recognition method and device based on BERT pre-training model
CN114492402A (en) * 2021-12-28 2022-05-13 北京航天智造科技发展有限公司 Scientific and technological new word recognition method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445915A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 New word discovery method and device
CN109710770A (en) * 2019-01-31 2019-05-03 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of file classification method and device based on transfer learning
CN110580287A (en) * 2019-08-20 2019-12-17 北京亚鸿世纪科技发展有限公司 Emotion classification method based ON transfer learning and ON-LSTM

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706807B (en) * 2009-11-27 2011-06-01 清华大学 Method for automatically acquiring new words from Chinese webpages
CN101930561A (en) * 2010-05-21 2010-12-29 电子科技大学 N-Gram participle model-based reverse neural network junk mail filter device
CN105512109B (en) * 2015-12-11 2019-04-16 北京锐安科技有限公司 The discovery method and device of new term
CN105930318B (en) * 2016-04-11 2018-10-19 深圳大学 A kind of term vector training method and system
CN107908618A (en) * 2017-11-01 2018-04-13 中国银行股份有限公司 A kind of hot spot word finds method and apparatus
US10540573B1 (en) * 2018-12-06 2020-01-21 Fmr Llc Story cycle time anomaly prediction and root cause identification in an agile development environment
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN
CN110287494A (en) * 2019-07-01 2019-09-27 济南浪潮高新科技投资发展有限公司 A method of the short text Similarity matching based on deep learning BERT algorithm
CN111401066B (en) * 2020-03-12 2022-04-12 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445915A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 New word discovery method and device
CN109710770A (en) * 2019-01-31 2019-05-03 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of file classification method and device based on transfer learning
CN110580287A (en) * 2019-08-20 2019-12-17 北京亚鸿世纪科技发展有限公司 Emotion classification method based ON transfer learning and ON-LSTM

Also Published As

Publication number Publication date
CN111563143A (en) 2020-08-21

Similar Documents

Publication Publication Date Title
CN111563143B (en) Method and device for determining new words
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
CN112199956B (en) Entity emotion analysis method based on deep representation learning
CN109753566A (en) The model training method of cross-cutting sentiment analysis based on convolutional neural networks
Zhang et al. Sentiment Classification Based on Piecewise Pooling Convolutional Neural Network.
CN106844349B (en) Comment spam recognition methods based on coorinated training
CN106919673A (en) Text mood analysis system based on deep learning
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN112131352A (en) Method and system for detecting bad information of webpage text type
CN111984791B (en) Attention mechanism-based long text classification method
CN113705238B (en) Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model
Grzegorczyk Vector representations of text data in deep learning
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
Tao et al. News text classification based on an improved convolutional neural network
US11822887B2 (en) Robust name matching with regularized embeddings
CN112434512A (en) New word determining method and device in combination with context
Nguyen et al. A self-attention network based node embedding model
Weijie et al. Long text classification based on BERT
CN116842934A (en) Multi-document fusion deep learning title generation method based on continuous learning
US20230376828A1 (en) Systems and methods for product retrieval
CN115730232A (en) Topic-correlation-based heterogeneous graph neural network cross-language text classification method
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant