CN111563143B

CN111563143B - Method and device for determining new words

Info

Publication number: CN111563143B
Application number: CN202010696059.4A
Authority: CN
Inventors: 刘凡平; 沈振雷; 陈慧
Original assignee: Shanghai 2345 Network Technology Co ltd
Current assignee: Shanghai 2345 Network Technology Co ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-11-03
Anticipated expiration: 2040-07-20
Also published as: CN111563143A

Abstract

The invention discloses a method for determining new words, which is used for determining new words based on a deep neural network and comprises the following steps: a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified; b: training the original candidate words based on a BERT model, and determining vectorized candidate words; c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network₁，y₂A neuron of { overspread }; d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words. The invention can largely determine new words appearing in the current society through the intelligent operation of big data of a computer in the whole process based on the search target and range, and expand the word bank of the input method.

Description

Method and device for determining new words

Technical Field

The invention belongs to the field of computer technology application, and particularly relates to a method and a device for determining new words.

Background

With the continuous progress of society and the popularization of the internet in people's daily life, the communication between people is not limited to face, but more effective communication is realized through the network, in such a diversified and fast-paced modern society, things are big and small at any moment, and the generation of new words is a product derived from the development of the modernization, which brings people into more effective and interesting communication, such as the new words ' jiongattitude ', ' kaoyou ', ' pigeon ', ' old driver ' and the like appearing in recent years, and the meanings and scenes drawn by the new words are gradually and widely accepted along with the wide application of people in communication.

However, as some third party platforms or systems, the third party platforms or systems are often required to be more suitable for the use habits and interests of users, so as to better provide high-quality services for users, and with the rapid development of the internet, the existing new words are more endless, and even for the third party platforms or systems, some troubles and influences are often brought to the users because some new words cannot be identified, how to better track over with the new words of the modern society becomes a technical problem to be solved by some current merchants, and how to obtain the new words appearing recently in a large amount and accurately is the most important technical problem at present.

The discovery of new words is generally considered from the degrees of freedom and the degree of solidification, the former has a relatively rich context, the latter also needs to satisfy a certain condition in the interior of the former, the interior of the word needs to be relatively stable or the degree of internal solidification is relatively high, and at present, a technical scheme capable of solving the technical problems is not provided, and specifically, a method and a device for determining the new words are not provided.

Disclosure of Invention

In view of the technical defects in the prior art, an object of the present invention is to provide a method and an apparatus for determining a new word, according to an aspect of the present invention, a method for determining a new word is provided, which determines a new word based on a deep neural network, and includes the following steps:

a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;

b: training the original candidate words based on a BERT model, and determining vectorized candidate words;

c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network₁，y₂The neuron of (a), wherein,

when y is₁Is 1, y₂When the word is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is₁Is 0, y₂When the word number is 1, determining that the original candidate word corresponding to the vectorization candidate word is not a word;

d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.

Preferably, in the step a, the text content is determined as the text to be authenticated by:

-a byte stream;

-a character stream; or

-a word stream.

Preferably, in the step a, the generation of the original candidate word based on the N-Gram algorithm is determined by:

a 1: performing sliding window operation with the size of N on a text to be identified to form character strings with the length of N, wherein each character string is called a gram, and 1 < N < M, and M is the number of the character strings of the original candidate word;

a 2: and determining all character strings with the length of N as original candidate words.

Preferably, in the step b, the BERT model is determined through a large amount of text and based on the word, semantic information of the word, and position information of the word.

Preferably, in the step b, the vectorized candidate word is a vector of 768 dimensions.

Preferably, in the step c, the deep neural network model is determined by: training the deep neural network model by words corresponding to the positive case characteristic vectors and non-words corresponding to the negative case characteristic vectors according to the data quantity with the same proportion, and adjusting model parameters through a back propagation algorithm to enable the deep neural network model to have the word discrimination capability,

wherein the positive case feature vector corresponds to a neuron labeled {1, 0 } and the negative case feature vector corresponds to a neuron labeled { 0,1 }.

Preferably, the back propagation algorithm tuning model parameters are determined by:

；

wherein the content of the first and second substances,

the new weight is represented by the new weight,

the weights in the previous iteration are represented,

which is indicative of the rate of learning,

indicating the back-propagated error resizing.

Preferably, in the step c, outputting the plurality of vectorization candidate words by:

wherein, in the step (A),

is the output value of the prediction and is,

is the desired output and L refers to the cross entropy loss function value loss.

Preferably, the database is a standard thesaurus.

According to another aspect of the present invention, there is provided a new word determination apparatus which determines a new word based on a deep neural network by using the determination method, including:

a first processing device: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;

the first determination means: training the original candidate words based on a BERT model, and determining vectorized candidate words;

a second processing device: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network₁，y₂The number of the neurons in the neuron is equal to the number of the neurons in the neuron,

second determining means: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.

Preferably, the first processing means includes:

a third processing device: carrying out sliding window operation with the size of N on a text to be identified to form a character string with the length of N;

third determining means: and determining all character strings with the length of N as original candidate words.

The invention discloses a method for determining new words, which is used for determining new words based on a deep neural network and comprises the following steps: a: generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified; b: training the original candidate words based on a BERT model, and determining vectorized candidate words; c: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network₁，y₂H, wherein when y is₁Is 1, y₂When the word is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is₁Is 0, y₂When the word number is 1, determining that the original candidate word corresponding to the vectorization candidate word is not a word; d: and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words. The method combines an N-Gram algorithm and a BERT model to determine and vectorize words in a text, purposefully adopts a creative deep neural network to output neurons based on a judgment standard, finally matches candidate words determined as the words with all the words in a database, if no such words exist, the candidate words are new words, the method can greatly determine the new words appearing in the current society based on a search target and a search range through intelligent operation of big data of a computer in the whole process, and expands an input method word bank.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a schematic flow chart illustrating a method for determining new words according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a specific process of generating a plurality of original candidate words based on an N-Gram algorithm and a text to be authenticated according to a first embodiment of the present invention;

FIG. 3 is a block diagram of a device for determining new words according to another embodiment of the present invention; and

fig. 4 is a diagram illustrating a deep neural network-based neuron determination according to a second embodiment of the present invention.

Detailed Description

In order to better and clearly show the technical scheme of the invention, the invention is further described with reference to the attached drawings.

Fig. 1 shows a specific flowchart of a method for determining a new word according to an embodiment of the present invention, and the present invention will further describe in detail a technical implementation of the method for determining a new word by using fig. 1 and fig. 2, and the present invention combines an N-Gram algorithm and a BERT model to determine and vectorize words in a text, and determines a new word based on a deep neural network, specifically, the method includes the following steps:

first, step S101 is performed to generate a plurality of original candidate words based on the N-Gram algorithm and the text to be identified, and those skilled in the art understand that N-Gram (sometimes also referred to as N-Gram) is a very important concept in natural language processing, and in NLP, one can predict or evaluate whether a sentence is reasonable or not by using N-Gram based on a certain corpus. On the other hand, another role of the N-Gram is to evaluate the degree of difference between two strings. This is a commonly used approach in fuzzy matching. In the present invention, the N-Gram algorithm is a control method for effectively segmenting text content to obtain a plurality of required data, which is a currently common prior art, and a text to be identified is used for segmented text content, and the original candidate word is required data that needs to be determined whether to be a word or not.

Further, in step S101, the text content is determined as the text to be authenticated by means of a byte stream, where the byte stream refers to a stream in which the most basic unit of transmission data is bytes during transmission, and the byte stream is composed of bytes, and in most cases, the bytes are the smallest basic unit of data, and accordingly, the text content may also be determined as the text to be authenticated by means of a character stream, where the character stream processes 2 bytes of Unicode characters, and the character stream processes characters, character arrays or character strings, respectively, and the byte stream processing unit processes bytes, operation bytes and byte arrays. Therefore, the character stream is formed by converting bytes into characters with Unicode characters of bytes as units by a Java virtual machine, the byte stream can be used for any type of objects including binary objects, and the character stream can only process the characters or character strings; the byte stream provides the functionality to handle any type of IO operation, but it cannot directly handle Unicode characters, as would a character stream.

In yet another very specific embodiment, the text content may also be determined as the text to be identified by means of a word stream, in which embodiment the word stream is a word stream consisting of several words to find long new words consisting of words. For example: "Beijing" is a word, and "university" is a word, which when combined may also be a new word. The word stream is for example: the Beijing university notices that a new word resulting from a combination of words can be found by the word stream, provided that "Beijing university" is a new word that has not been found. The idea of N-Gram is to decide on these words: beijing university, university student recruitment, student recruitment notification, Beijing university student recruitment, university student recruitment notification, and Beijing university student recruitment notification.

Then, step S102 is performed, a plurality of original candidate words are trained based on a BERT model, and a plurality of vectorized candidate words are determined, BERT (bidirectional Encoder retrieval from transformations) is a general pre-training language representation model with a very good effect, a method for obtaining a pre-training representation model at present mainly includes a Feature-based (Feature-based) method or a Fine-tuning (Fine-tuning) method, a word vector generation process adopts the BERT model, and the BERT model inputs: token entries, segmentation entries and position entries, which are mainly applied to a feature-based method in the present invention, i.e., converting an input word into a word vector, sequentially inputting the word vector into lstm in the forward and reverse directions, respectively, to obtain corresponding outputs, stacking L layers, and linearly combining the corresponding feature outputs of the two parts, to obtain a pre-training representation model.

Further, in the step S102, a BERT model is determined through a large amount of text and based on the word, semantic information of the word, and position information of the word, in such an embodiment, an implementation manner for establishing the BERT model is disclosed, and since establishing the BERT model is a prior art, how to determine the BERT model through a large amount of text and based on the word, semantic information of the word, and position information of the word in a targeted manner is a specific and targeted implementation scheme of the present invention, similarity of generated vectors is close between similar words.

Those skilled in the art will appreciate that the BERT model employs a Transformer and when processing a word, can also take into account words preceding and following the word to derive its meaning in context. And transmitting the characters of the acquired text, the sentences of the characters and the position information as input to a BERT model through new texts acquired on the Internet every day, and obtaining the trained BERT model which is the model of the words embedding through iteration until convergence is stable. The BERT model randomly selects 15% of words in the corpus, then 80% of the words are replaced by Mask masks, 10% of the words are randomly changed into another word, the remaining 10% of the words are kept unchanged, and then the model is required to correctly predict the selected words, so that the semantic understanding of the words is achieved.

Further, in the step S102, the vectorized candidate word is a vector with 768 dimensions, in such an embodiment, an N-Gram algorithm is used first, and then a Bert model is used to generate the vectorized candidate word, where the vectorized candidate word has 768 dimensions.

Then, the liquidThen, step S103 is entered, and a plurality of vectorization candidate words are output as a mark y based on the deep neural network₁，y₂The neuron of (1) when y₁Is 1, y₂When the word is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is₁Is 0, y₂When the word is 1, determining that an original candidate word corresponding to the vectorization candidate word is not a word, labeling vectorization representation of the candidate word based on a deep neural network, wherein the labeling form is One-Hot coding, combining step S101 and step S102, in a preferred embodiment, the text to be identified is ' good learning ', determining One original candidate word as ' learning ' by using an N-Gram algorithm, generating vectorization candidate words by a Bert model, wherein the vectorization candidate words have 768 dimensions, and outputting a plurality of vectorization candidate words to be labeled as { y ' based on the deep neural network₁，y₂When the original candidate word is 'learning', the neuron marked as {1, 0 } is output, and the 'learning' is determined to be a word.

Further, in the step S103, a deep neural network model is determined by: training a deep neural network model by words corresponding to positive example feature vectors and non-words corresponding to negative example feature vectors according to data volume of the same proportion, and adjusting model parameters through a back propagation algorithm to enable the deep neural network model to have word judgment capacity, wherein the positive example feature vectors correspond to neurons marked as {1, 0 } and the negative example feature vectors correspond to neurons marked as { 0,1 }.

Those skilled in the art understand that the input is divided into two parts, positive and negative examples, the positive example being a feature vector of a normal word, such as a 768-dimensional vector of "computer"; the negative example is a vector of a non-normal word, for example, the 'I eat' is not a 768-dimensional feature vector of a word, the model is trained through the positive example and the negative example with the same proportion, namely, the positive example and the negative example respectively occupy 50% of data volume, then the positive example is marked as [1,0], the negative example is marked as [0,1], and model parameters are adjusted through a back propagation algorithm, so that the model has the word judgment capability.

Further, the back propagation algorithm tuning model parameters are determined by:

；

wherein the content of the first and second substances,

the new weight is represented by the new weight,

the weights in the previous iteration are represented,

which is indicative of the rate of learning,

indicating the back-propagated error resizing.

Further, fig. 4 shows a schematic diagram of determining neurons based on a deep neural network according to a second embodiment of the present invention, and in combination with the embodiment shown in fig. 4, those skilled in the art understand that a process of identifying words adopts a supervised learning manner, and a classification network model structure is a fully connected deep neural network through a classification problem solution. The final purpose of the invention is to judge the probability of whether the word is a real word or not based on the word formed by inputting a plurality of words. As shown in fig. 4, the input to the deep neural network model is a 768-dimensional vector of words (i.e., n =768 neurons); the output of the deep neural network model is 2 neurons, namely [1,0] represents a word, and [0,1] represents not a word; the loss function of the deep neural network model is a cross entropy loss function, and the optimization method of the deep neural network model is Adam.

Further, in the step S103, outputting a plurality of vectorization candidate words by:

wherein, in the step (A),

is the output value of the prediction and is,

is the desired output, L refers to the cross entropy loss function value loss, which we know in the two-class problem model: such as Logistic Regression, Neural Network, etc., the label of the real sample is [0,1]]Respectively, a negative class and a positive class. The model will eventually output a probability value, usually via a Sigmoid function, that reflects the probability of predicting a positive class: the greater the probability, and the expression and graph of Sigmoid function are as follows: g(s) =11+ e-sg(s) = \ frac {1} {1+ e ^ s } }, where s is the output of the layer above the model, and the Sigmoid function has the following characteristics: s =0, g(s) = 0.5; s>>At 0, g ≈ 1, s<<At 0, g ≈ 0. Obviously, g(s) maps the linear output of the previous stage to [0, 1%]In numerical probability in between. Where g(s) is the model prediction output in the cross entropy formula. The prediction output, i.e. the output of the Sigmoid function, characterizes the probability that the current sample label is 1: y = P (y =1| x) \ hat y = P (y =1| x), it is clear that the probability that the current sample label is 0 can be expressed as: 1-y = P (y =0| x)1- \ hat y = P (y =0| x), if the above two cases are integrated together from the perspective of maximum likelihood, when the true sample label y =0, the first term of the above equation is 1, and the probability equation turns into: p (y =0| x) = 1-yP (y =0| x) =1- \ hat y, when the true sample label y =1, the second term of the above equation is 1, and the probability equation is converted to: p (y =1| x) = yP (y =1| x) = \ hat y, and in both cases the probability expression is identical to the previous one, except that we integrate the two cases together. We want the probability P (y | x) to be as large as possible. First, we introduce a log function to P (y | x),since log operations do not affect the monotonicity of the function itself. We prefer that the larger the log P (y | x) the better, and vice versa, as long as the negative value of log P (y | x) -log P (y | x) is smaller. We can introduce a Loss function and let Loss = -log P (y | x). The loss function is then obtained as: l = - [ ylog a + (1-y) log (1-a)]. Those skilled in the art will appreciate that cross-entropy loss can be very sensitive to differences in classification effect and that accurate quantization can be achieved.

Finally, step S104 is performed, one or more original candidate words determined as words are matched in a database, if the one or more original candidate words are not in the database, the one or more original candidate words are determined as new words, it is understood by those skilled in the art that the database is a standard word library, and in combination with steps S101 to S104, first, the system prepares the standard word library, searches for a large number of articles, generates original candidate words based on step S101, and in this step, generates a plurality of original candidate words in combination with N-Gram, and then determines the probability that a candidate word is a word, at this time, a "deep neural network" is used for determining, for example: the original candidate word of 'good learning' is combined with step S102, a Bert model is used to generate a vectorization candidate word with 768 dimensions, then, the deep neural network is adopted to judge through step S103, finally, 2 vectors [1,0] and [0,1], [1,0] are generated to represent that the word is, and [0,1] represents that the word is not, further, in step S104, the word is searched in a standard lexicon, if the word exists in the standard lexicon, no processing is performed, if the word does not exist in the standard lexicon, the word is marked as a new word, and the new word is added into the standard lexicon.

In a preferred embodiment, after the new word is added to the standard word stock, the newly added new word is compared with other words in the standard word stock to determine the domain to which the new word belongs, and the semantic meaning is known, for example: the new word of 'good learning' is compared with the word of 'learning' in the word stock, the vector included angle is calculated, if the included angle is smaller, the description fields are approximately the same, the word senses are similar, and the field to which the word belongs can be approximately judged.

Fig. 2 shows a detailed flowchart of a first embodiment of the present invention, where the flowchart is a specific flowchart of generating a plurality of original candidate words based on an N-Gram algorithm and a text to be evaluated, fig. 2 is a detailed step of step S101 of the present invention, and in step S101, the original candidate words generated based on the N-Gram algorithm are determined as follows:

firstly, step S1011 is entered, and a sliding window operation with a size of N is performed on the text to be authenticated, so as to form a character string with a length of N, each character string is called a gram, where 1 < N < M, and M is the number of character strings of the original candidate word, and those skilled in the art understand that if the original candidate word is 8, N may be preferably 2, 3, 4, 5, 6, 7, in such an embodiment, several candidate words are generated based on a bigram, a triplet, a quadruplet, a quingram, a hexagram, and a heptagram, for example, the original candidate word is "has no intention to apply to people", and the bigram is "has place", "has not done, not want to apply", "does not apply to people" according to the above dividing manner; the triplets are "none", "not wanted", "wanted on a person"; the four-tuples are "unwanted", and "unwanted" for a person; the quintuple is "Do not you want to do it by oneself", "Do not you want to do it", "Do not want to do it to a person"; the six-tuple is "do not apply to oneself", "do not apply to you", "do not apply to man"; the seven-tuple is 'no application to oneself' and 'no application to people' and further, each candidate word is judged whether the candidate word belongs to a word or not according to the mode of identifying the word, if the candidate word belongs to a word, the existing word is filtered, and if the candidate word does not belong to the existing word bank, the candidate word is found as a new word.

Finally, step S1012 is performed, all character strings with length N are determined as original candidate words, and in this step, all candidate words generated by binary, ternary, quaternary, quinary, hexagonal, and heptatuple are determined as original candidate words in combination with step S1011.

The technical personnel in the field understand that the invention combines the N-Gram algorithm, the BERT model and the neural network to improve the discovery rate of new words, in particular, the N-Gram can improve the coverage rate of word discovery, the Bert can discover semantic relations and relations among words, the neural network algorithm can efficiently discover new words by means of the classification capability of data fitting, after the three are combined, research data shows that the new words are discovered to be 40 words per day, the accuracy is 70-80 percent, and the technical effects which can be achieved by the technical scheme recorded by the invention can not be achieved no matter the three technologies are used for new word discovery alone or any two-two combination of the three technologies, and the three technical characteristics are the technical characteristics which have step characteristics, are sequential and are closely connected, namely, the invention is the new word discovery realized on the basis of the technical scheme of the combination of the three technical characteristics, compared with the prior art, the method can greatly determine new words appearing in the current society through computer big data intelligent operation in the whole process based on the search target and range, expand the word bank of the input method, and efficiently acquire a large number of new words with high probability.

More specifically, compared with the prior art, the technical scheme that the N-Gram algorithm, the BERT model and the neural network are combined is adopted, the characteristics that the N-Gram algorithm obtains the diversity and the integrity of words in the big data are exerted, and the new words are found and determined in the big data without omission and are more accurate.

The BERT model combines the context of the vocabulary in the application, more comprehensively understands and positions the vocabulary function based on the words, the semantic information of the words and the position information of the words, and is used for outputting the final new word judgment result in cooperation with the neural networkVectorized candidate words are used as input, and training of the neural network is carried out, so that output of y is obtained₁，y₂In such an embodiment, if the threshold for determining whether a word is a new word is determined to be lower, the more the number of suspected new words can be obtained, but the overall accuracy may be reduced, and if the threshold for determining whether a word is a new word is determined to be higher, the less the number of suspected new words can be obtained, but the overall accuracy may be improved. More specifically, no matter what the threshold value of the probability of determining whether a word is a new word is determined, in a preferred embodiment, the application can quickly find the new word only by putting the latest web article into the technical scheme recorded in the application without changing the model algorithm or changing the model parameters.

By combining the embodiment, the preparation of the labeled data set can be reduced, in the prior art, a large amount of manpower and financial resources are consumed frequently due to the data labeling for preprocessing the discovery of the new words, and research results show that about half of the labeled data set can be reduced, so that the consumption of manpower and financial resources is reduced, and the efficiency of discovering the new words is improved.

Fig. 3 is a schematic block connection diagram of a new word determining apparatus according to another embodiment of the present invention, and those skilled in the art understand that the new word determining apparatus determines a new word based on a deep neural network by using the determining method, including: the first processing device 1: based on the N-Gram algorithm and the text to be identified, a plurality of original candidate words are generated, and the working principle of the first processing device 1 may refer to the step S101, which is not described herein again.

Further, the determining means of the new word further comprises the first determining means 2: based on the BERT model, a plurality of original candidate words are trained, and a plurality of vectorized candidate words are determined, and the working principle of the first determining device 2 may refer to the step S102, which is not described herein again.

Further, the determining means of the new word further comprises second processing means 3: outputting a plurality of vectorized candidate words as labeled y based on a deep neural network₁，y₂For the neurons of, the working principle of the second processing device 3 can refer to the foregoing step S103, which is not described herein again.

Further, the determining means of the new word further comprises second determining means 4: one or more original candidate words determined as words are matched in the database, and if the one or more original candidate words are not present in the database, one or more original candidate words are determined as new words, and the working principle of the second determining device 4 may refer to the step S104, which is not described herein again.

Further, the first processing device 1 comprises a third processing device 11: the operation of the sliding window with the size of N is performed on the text to be identified to form the character string with the length of N, and the working principle of the third processing device 11 may refer to the foregoing step S1011, which is not described herein again.

Further, the first processing device 1 further comprises a third determining device 12: the working principle of the third determining device 12 may refer to the step S1012, and is not described herein again.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A method for determining a new word based on a deep neural network is characterized by comprising the following steps:

b: determining a BERT model through a large amount of texts and based on characters, semantic information of the characters and position information of the characters, training a plurality of original candidate words by a characteristic-based method, and determining a plurality of vectorized candidate words;

when y is₁Is 1, y₂When the word number is 0, determining the original candidate word corresponding to the vectorization candidate word as a word, and when y is₁Is 0, y₂When the word number is 1, determining that the original candidate word corresponding to the vectorization candidate word is not a word;

2. The determination method according to claim 1, wherein in the step a, text content is determined as the text to be authenticated by any one of the following methods:

-a byte stream;

-a character stream; or

-a word stream.

3. The determination method according to claim 1, wherein in the step a, the generation of the original candidate word based on the N-Gram algorithm is determined by:

4. The method of claim 1, wherein in step b, the vectorized candidate word is a 768-dimensional vector.

5. The determination method according to claim 1, wherein in the step c, the deep neural network model is determined by: training the deep neural network model by words corresponding to the positive case characteristic vectors and non-words corresponding to the negative case characteristic vectors according to the data quantity with the same proportion, and adjusting model parameters through a back propagation algorithm to enable the deep neural network model to have the word discrimination capability,

6. The determination method according to claim 5, wherein the back propagation algorithm tuning model parameters are determined by:

；

wherein the content of the first and second substances,

the new weight is represented by the new weight,

the weights in the previous iteration are represented,

which is indicative of the rate of learning,

indicating the back-propagated error resizing.

7. The method of claim 1, wherein in step c, outputting the plurality of vectorized candidate words is performed by:

wherein, in the step (A),

y is the predicted output value, a is the desired output, and L is the cross entropy loss function value loss.

8. The method of claim 1, wherein the database is a standard lexicon.

9. A new word determination apparatus that determines a new word based on a deep neural network by using the determination method according to any one of claims 1 to 8, comprising:

first treatment device (1): generating a plurality of original candidate words based on an N-Gram algorithm and a text to be identified;

first determination means (2): training the original candidate words based on a BERT model, and determining vectorized candidate words;

second treatment device (3): outputting a plurality of vectorized candidate words as labeled y based on a deep neural network₁，y₂A neuron of { overspread };

second determination means (4): and matching one or more original candidate words determined as words in a database, and if the original candidate words do not exist in the database, determining one or more original candidate words as new words.

10. The determination device according to claim 9, characterized in that the first processing means (1) comprise:

third processing device (11): carrying out sliding window operation with the size of N on a text to be identified to form a character string with the length of N;

third determination means (12): and determining all character strings with the length of N as original candidate words.