CN117150317A

CN117150317A - New word determining method and device, terminal equipment and storage medium

Info

Publication number: CN117150317A
Application number: CN202311200191.1A
Authority: CN
Inventors: 任宏杰
Original assignee: Avatr Technology Chongqing Co Ltd
Current assignee: Avatr Technology Chongqing Co Ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-01

Abstract

The application relates to the technical field of artificial intelligence, and discloses a new word determining method, a device, terminal equipment and a storage medium, wherein the method comprises the following steps: obtaining a corpus text, and segmenting the corpus text to obtain candidate words corresponding to the corpus text; determining a value score of the candidate word according to the word characteristics and the characteristic weight parameters of the candidate word; determining the similarity between the candidate words and a preset corpus; and determining whether the candidate word is a new word according to the value score and the similarity. On the basis that whether the candidate word is a new word or not is determined according to word characteristics in the prior art, the similarity between the candidate word and the existing new word in a preset corpus is further considered, the value score of the candidate word is determined according to the word characteristics and the characteristic weight parameters of the candidate word, and whether the candidate word is the new word or not is comprehensively determined according to the value score of the candidate word and the similarity between the candidate word and the preset corpus; therefore, the method can more accurately determine the new words.

Description

New word determining method and device, terminal equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a new word determining method, a new word determining device, a terminal device, and a storage medium.

Background

Along with the rapid development of artificial intelligence, intelligent customer service, content pushing, information matching and other applications are becoming more and more widespread, and timely discovery of new words in mass content becomes one of the centers of gravity of technical development. For example, aiming at intelligent customer service, key information such as interest points, attention points, doubtful points, expected points and the like of customers is mined based on a text mining mode, so that a service flow is improved according to the key information, a sales ditch conversation operation is optimized, and after-sales experience of users is improved.

In the prior art, after word segmentation is performed on the text information of the material to obtain candidate words, whether the candidate words are new words is determined according to the expected information entropy or mutual information of the candidates. But the new words determined according to this solution are not accurate.

Therefore, how to improve the accuracy of determining new words is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the above problems, embodiments of the present application provide a new word determining method, apparatus, terminal device, and computer readable storage medium, which can improve accuracy of determining a new word.

In a first aspect, the present application provides a new word determining method. The method comprises the following steps:

Obtaining a corpus text, and segmenting the corpus text to obtain candidate words corresponding to the corpus text;

determining a value score of the candidate word according to the word characteristics and the characteristic weight parameters of the candidate word;

determining the similarity between the candidate words and a preset corpus;

and determining whether the candidate word is a new word according to the value score and the similarity.

In one embodiment, the determining the similarity between the candidate word and the preset corpus includes:

converting the candidate words into word vectors;

and determining the similarity between the word vector and a preset corpus by using a classification model.

In one embodiment, the determining the value score of the candidate word according to the word feature and the feature weight parameter of the candidate word includes:

determining word characteristics of the candidate words; the word characteristics comprise information entropy and mutual information;

acquiring feature weight parameters of the word features;

and determining the value score of the candidate word according to the word characteristic and the characteristic weight parameter.

In one embodiment, the determining whether the candidate word is a new word according to the value score and the similarity includes:

Acquiring a first calculation weight parameter and a second calculation weight parameter which respectively correspond to the value score and the similarity;

determining a composite score corresponding to the candidate word according to the value score, the similarity, the first calculation weight parameter and the second calculation weight parameter;

and determining whether the candidate word is a new word according to the comprehensive score.

In one embodiment, the obtaining the corpus text and word segmentation of the corpus text to obtain the candidate words corresponding to the corpus text includes:

acquiring a corpus text;

and segmenting the corpus text by using the N-gram model to obtain candidate words corresponding to the corpus text.

In one embodiment, the feature weight parameter is a BM25 value or TF-IDF value of the candidate word.

In one embodiment, the method further comprises:

and adjusting the value range of the word characteristics to be within a preset value range.

In a second aspect, the application further provides a new word determining device. The device comprises:

the acquisition module is used for acquiring a corpus text, and segmenting the corpus text to obtain candidate words corresponding to the corpus text;

The first calculation module is used for determining a value score of the candidate word according to the word characteristics and the characteristic weight parameters of the candidate word;

the second calculation module is used for determining the similarity between the candidate words and a preset corpus;

and the determining module is used for determining whether the candidate word is a new word according to the value score and the similarity.

In a third aspect, the application further provides a terminal device. The terminal device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, which processor, when executing the computer program, implements the steps of the method as described above.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium stores a computer program which, when executed by a processor, implements the steps of the method as described above.

The application provides a new word determining method, which further considers the similarity of the candidate word and the existing new word in a preset corpus on the basis that the candidate word is determined to be the new word only according to the word characteristics in the prior art, determines the value score of the candidate word according to the word characteristics and the characteristic weight parameters of the candidate word, and comprehensively determines whether the candidate word is the new word according to the value score of the candidate word and the similarity of the candidate word and the preset corpus; therefore, the method can more accurately determine the new words.

It can be appreciated that the new word determining device, the terminal device and the computer readable storage medium provided in the embodiments of the present application have the same beneficial effects as the new word determining method described above, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a new word determining method according to an embodiment of the present application;

FIG. 2 is a schematic process diagram of another new word determining method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a new word determining device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise. "plurality" means "two or more".

The new word determining method provided by the embodiment of the application can be executed by the processor of the terminal equipment when the corresponding computer program is run.

Fig. 1 is a flowchart of a new word determining method according to an embodiment of the present application, and for convenience of explanation, only a portion related to the present embodiment is shown, where the method provided in the present embodiment includes the following steps:

S100: and obtaining a corpus text, and segmenting the corpus text to obtain candidate words corresponding to the corpus text.

Wherein, the corpus text refers to sentences, paragraphs or articles in which whether new words exist or not needs to be determined; in general, a corpus text is obtained according to an actual application scene, for example, a recording text obtained by converting a communication recording may be used as the corpus text, and an exchange text in which a customer service communicates with a customer text may be used as the corpus text.

Where candidate words refer to segmentations that require a determination of whether they are new words. Specifically, after the corpus text is obtained, word segmentation is performed on the corpus text to obtain candidate words corresponding to the corpus text. It can be understood that the corpus text after word segmentation generally corresponds to at least one candidate word, and then for each candidate word, whether the candidate word is a new word or not is determined respectively.

S200: and determining the value score of the candidate word according to the word characteristics and the characteristic weight parameters of the candidate word.

The word characteristics refer to parameters representing characteristics of candidate words, the word characteristics can include information entropy, mutual information and the like, and specific contents of the word characteristics are not limited in the embodiment. The feature weight parameter refers to a parameter characterizing the weight of a word feature.

The value score is a parameter obtained by carrying out weighted calculation according to the word characteristics and the characteristic weight parameters, and the value score is used for representing the characteristics of the characterization candidate words determined according to the word characteristics.

S300: and determining the similarity between the candidate words and a preset corpus.

The preset corpus, that is, the database storing the preset corpus determined to be new words, is a collection of a large amount of written or spoken text for language research. The similarity between the candidate words and the preset corpus is determined, namely, a semantic similarity measurement method for determining the similarity between the words according to the information of each preset corpus in the preset corpus.

Specifically, determining the similarity between the candidate word and the preset corpus based on the preset corpus includes determining the similarity between the candidate word and the existing corpus by means of Latent Semantic Analysis (LSA), normalized Google Distance (NGD), and the like, and the embodiment does not limit the specific way of calculating the similarity.

S400: and determining whether the candidate word is a new word according to the value score and the similarity.

Specifically, after the value score and the similarity are determined, calculating a comprehensive score corresponding to the candidate word according to the value score and the similarity; comparing the comprehensive score with a preset scoring threshold value, and determining the candidate word as a new word if the comprehensive score is greater than or equal to the preset scoring threshold value; and if the comprehensive score is smaller than the preset scoring threshold value, determining that the candidate word is not a new word.

The embodiment of the application provides a new word determining method, which further considers the similarity between a candidate word and the existing new word in a preset corpus on the basis of determining whether the candidate word is the new word only according to word characteristics in the prior art, determines the value score of the candidate word according to the word characteristics and characteristic weight parameters of the candidate word, and comprehensively determines whether the candidate word is the new word according to the value score of the candidate word and the similarity between the candidate word and the preset corpus; therefore, the method can more accurately determine the new words.

On the basis of the above embodiment, the present embodiment further describes and optimizes a technical solution, and specifically, in this embodiment, determining similarity between a candidate word and a preset corpus includes:

converting the candidate words into word vectors;

and determining the similarity between the word vector and a preset corpus by using the classification model.

The term vector refers to a representation manner of representing the term by a vector, that is, the term vector represents the candidate word in the corpus text as a corresponding vector. It will be appreciated that candidate words need to be encoded into numerical variables before they are input into a neural network model, such as a classification model; there are two common coding modes: one-hot encoding (One-Hot Representation) and distributed encoding (Distributed Representation).

In a specific embodiment, the BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoder representation of transformations) model may be utilized to transform each candidate word in the corpus text into a corresponding high-dimensional vector (word vector), abstracting the text into a numeric result that is valuable for data modeling. The text representation based on the context is realized by the BERT model mainly by using a transformer structure, and the embeddingtable obtained by BERT pre-training can also be used as a static vector representation method and can be regarded as the commonality (average) representation of the words learned after training based on a large number of corpora. Wherein Token references are word vectors; each word in the corpus text is converted into a one-dimensional vector by establishing a word vector table, and the one-dimensional vector is used as the input of the BERT model. In particular, the english vocabulary may be split into finer granularity, such as play or into play and #. In addition, cutting words into finer granularity WordPiece is a common method to address unregistered words. Assuming that the corpus text is 'Ilikedog', the input corpus text is subjected to Token processing before entering a Token Embeddding layer, token [ CLS ] is inserted at the beginning of the text, and Token [ SEP ] is inserted at the end of the text; where [ CLS ] represents the feature for the classification model, the symbol may be omitted for non-classification models, [ SEP ] represents the clause symbol for breaking two sentences in the input corpus. The BERT model requires only 30222 words when processing english text, and the Token documents layer converts each word into 768-dimensional vectors. For example, 5 Token would be converted into a matrix of (6,768) or tensor of (1,6,768).

In another embodiment, a Word2Vec model may be used to convert candidate words into Word vectors. Wherein Word2Vec model is a tool used to generate Word vectors; is a lightweight neural network, the model only comprises an input layer, a hidden layer and an output layer, and the Word2Vec model framework mainly comprises CBOW and Skip-gram models according to the difference of input and output.

In the BERT model, surrounding word information is encoded through an attention structure in a transformer, and the surrounding character information is not utilized by the embedding information in the embedding table; the information of surrounding words can be further utilized by the enabling table obtained through training of the Word2Vec model, so that the enabling table obtained through the Word2Vec model is better in effect.

The classification model refers to a model obtained by learning and training by using training samples in advance, and unknown samples can be classified by using the classification model later. Classification models include decision trees, support vector machines, XGBoost (eXtreme Gradient Boosting) models, lightGBM models, and the like. The decision tree is a non-parameter supervised learning method, can summarize decision rules from a series of data with characteristics and labels, and presents the rules by using a tree diagram structure so as to solve the problems of classification and regression; the support vector machine is a typical two-class model; the XGBoost model is an integrated classifier integrated by using a lifting method (Boosting), and a plurality of weak classifiers are integrated together to form a model of a strong classifier; the LightGBM model is also an integrated classifier that is integrated using Boosting (Boosting), and is iteratively trained using weak classifiers (decision trees) to obtain an optimal model.

In one embodiment, word vectors are classified by using a gradient-lifted decision tree model (Gradient Boosting Decision Tree, GBDT) to calculate the similarity of word vectors corresponding to candidate words to a preset corpus. The gradient lifting decision tree algorithm is an addition model based on Boosting integration thought, and a forward distribution algorithm is adopted to learn greedy during training, and each iteration learns a CART tree to fit the residual error between the predicted result of the t-1 tree and the true value of the training sample before fitting.

The basic idea of the XGBoost model is the same as that of the gradient lifting decision tree model, but the second derivative of the XGBoost model enables a loss function to be more accurate, regular terms avoid tree overfitting, block storage can be calculated in parallel and the like, so that the XGBoost model has the characteristics of high efficiency, flexibility and portability, and the accuracy and convenience for determining the similarity of candidate words and a preset corpus can be improved.

In a specific implementation, classifying and modeling the candidate words (namely word vectors) after vectorization through an XGBoost model, calculating the similarity between the candidate words and a preset prediction library (existing corpus), and identifying the corpus with higher similarity with the existing corpus in each candidate word. Therefore, according to the method of the embodiment, the similarity between the candidate word and the preset corpus can be determined efficiently and accurately, and the accuracy and efficiency of determining the new word are improved.

On the basis of the above embodiment, the present embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, determining the value score of the candidate word according to the word feature and the feature weight parameter of the candidate word includes:

determining word characteristics of candidate words; word characteristics include information entropy and mutual information;

acquiring feature weight parameters of word features;

and determining the value score of the candidate word according to the word characteristics and the characteristic weight parameters.

In this embodiment, the word characteristics of the candidate words include information entropy and mutual information.

The information entropy comprises a left adjacent entropy value and a right adjacent entropy value; the left-neighbor entropy value and the right-neighbor entropy value represent the richness of the left and right neighboring words of the candidate word.

Wherein, the left-neighbor entropy value EL is:

wherein, right neighbor entropy value ER is:

wherein Wm represents a set of left-neighbor strings corresponding to the candidate word, and Wn represents a set of right-neighbor strings corresponding to the candidate word; w (w) _i Representing the ith character string, s representing a character string combination; p (w) _i S) represents the i-th character string w _i Probability of occurrence in the string combination s.

Determining the information entropy L of the candidate word according to the left-neighbor entropy value and the right-neighbor entropy value _w (EL,ER)：

It should be noted that, the information entropy can reflect the information entropy that will be brought about by averaging after knowing the result of an event. For example, the probability of occurrence of a result is p, and when we know that it does occur, its corresponding entropy is defined as-log (p); the smaller p is, the larger the corresponding information entropy is.

For example, six faces of a dice are 1, 2 and 3, respectively, and the corresponding entropy of information for a dice roll of 1 is-log (1/2), about 0.693; the corresponding information entropy is-log (1/3) ≡ 1.0986 when the dice rolling result is 2; the entropy of the corresponding information when the dice roll result is 3 is-log (1/6) ≡1.79. Wherein, the probability of obtaining the information entropy of 0.693 is 1/2; the probability of obtaining the information entropy of 1.0986 is 1/3; the probability of obtaining the information entropy of 1.79 is 1/6; thus, the information entropy of 0.693/2+1.0986/3+1.79/6 approximately equal to 1.0114 is obtained under the average condition; 1.0114 is also the information entropy of the dice. If a dice has 100 faces, 99 faces are all 1, and only one face is 2; the result of the dice roll is that the entropy of 2 is-log (1/100), about 4.605; the probability of obtaining an information entropy of 4.605 is one percent, and in other cases, only an information entropy of-log (99/100) ≡ 0.01005 can be obtained. Therefore, only 0.056 entropy can be obtained on average, which is the entropy of the dice. Consider again the most extreme case: if the six faces of a dice are all 1, throwing it does not bring any entropy, so the entropy of the dice is-log (1) =0.

That is, the information entropy intuitively reflects the randomness of the result of an event, and the larger the left-neighbor entropy value and the right-neighbor entropy value, the more abundant the words adjacent to the candidate word, the greater the degree of freedom of the candidate word, which becomes an independent word, i.e., the greater the likelihood of a new word.

In practical application, the randomness of the left-neighbor character string set and the right-neighbor character string set of a candidate word can be measured by using information entropy. Assuming that the corpus text is 'eating grape without spitting grape skin and eating grape with spitting grape skin', wherein the term of 'grape' appears four times, wherein the left adjacent character string set is { eat, spit, eat, spit }, and the right adjacent character string set is { do not, skin, spit, skin }; therefore, the information entropy of the left-neighbor string set of the term "grape" is- (1/2) -log (1/2) -0.693, and the information entropy of the right-neighbor string set is- (1/2) -log (1/2) - (1/4) -log (1/4) -1.04; it can be seen that in this corpus text, the right neighbor of the word "grape" is more abundant.

Wherein the mutual information reflects the degree of mutual dependence before the two words. In general, the larger the mutual information value, the larger the relativity of the two words, namely that the two words are frequently appeared together, and the larger the solidification degree of the two words is, the greater the possibility that the two words form a phrase is; conversely, the smaller the mutual information value, the less relevant the two words are, and the lower the probability that the two words appear together, and the smaller the degree of solidification of the two words is, the less likely the two words are to form a phrase.

Specifically, the calculation formula of the mutual information is as follows:

wherein X and Y represent two adjacent words, p (X, Y) represents the probability that the two words X and Y occur together, and p (X) and p (Y) represent the probability that X and Y occur separately. For example, assume that in a corpus text, a word of "deep learning" appears 10 words, a word of "deep" appears 15 times, and a word of "learning" appears 20 times; since the total word number of the corpus is a constant value, the mutual information between the points of the word on the deep learning is thatWherein N is the total word number of the corpus.

In a specific embodiment, the feature weight parameter is a candidate word BM25 value or TF-IDF value.

The BM25 (Best Match-25) algorithm is an algorithm for calculating the relevance between corpus text and documents, and is an algorithm improved based on TF-IDF. Specifically, TF of BM25 value _score The formula is as follows:

wherein,k1 represents the rising speed of word frequency result in word frequency saturation, and the smaller the k1 value isThe faster the saturation changes; the larger the k1 value, the slower the saturation change; typically a default value of 1.2.b represents the function of the segment length normalization value, namely controlling the range of using the document length to represent the information quantity; b is 0 to disable normalization, and b is 1 to enable normalization completely; typically defaults to 0.75.avgdl is the average length of all documents; dl is the length of the document.

Wherein, TF-IDF (Term Frequency-inverse document Frequency ) refers to the importance degree of a word to one of the documents in a document set or a corpus; the importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. Where Term Frequency (TF) refers to the Frequency with which a given word appears in the document. For candidate word t in a particular document _i For example, its importance can be expressed as:

wherein n is _i,j Representing candidate word t _i In document d _j The number of occurrences of TF _i,j Representing candidate word t _i In document d _j Is the frequency of occurrence of (a); sigma (sigma) _k n _k,j Represented in document d _j The sum of the number of occurrences of all words in (a).

Where reverse document frequency (Inverse Document Frequency, IDF) is a measure of the general importance of a word. The IDF of a particular word may be obtained by dividing the total number of documents by the number of documents containing the word, and taking the logarithm of the quotient obtained:

where |D| represents the total number of documents in the corpus; i { j: t _i ∈d _j The } | indicates that the candidate word t is included _i Is a number of documents; to prevent candidate word t _i The situation that the denominator is 0 is not caused in the corpus, so the denominator is set to be 1+| { j: t _i ∈d _j }|。

It will be appreciated that TF-IDF tends to filter out common words, retaining important words, and is calculated in the following way:

TF-IDF＝TF _i,j ·IDF _i 。

in a specific embodiment, for each candidate word, calculating word characteristics of the candidate word, including mutual information, a left-neighbor entropy value and a right-neighbor entropy value; calculating BM25 values of the candidate words, and determining the BM25 values as characteristic weight parameters; and inputting the mutual information, the left neighbor entropy value, the right neighbor entropy value and the BM25 value (characteristic weight parameter) into a new word judgment function formula, and calculating the value score of the candidate word.

Wherein, the new word judgment function formula is as follows:

Score(Value)＝ω*Score(PMI)+ω*Score(WmEL)+ω*Score(WnER)；

wherein Score (Value) represents the Value Score of the candidate word; score (PMI) is the mutual information value of the candidate word; score (WmEL) and Score (WnER) are left-neighbor entropy and right-neighbor entropy, respectively; wm and Wn are respectively a left adjacent character string set and a right adjacent character string set; ω is the BM25 value of the candidate word, i.e. the feature weight parameter.

Therefore, the embodiment uses the BM25 value or the TF-IDF value of the candidate word as the characteristic weight parameter of the candidate word, and can accurately describe the weight of the candidate word, so that the value score of the candidate word determined according to the characteristic weight parameter is more accurate; according to the method of the embodiment, the influence of the left and right adjacent words of the candidate word on the value score of the candidate word is considered, and the influence of the context of the document where the candidate word is located on the value score of the candidate word is considered, so that the accuracy of determining whether the candidate word is a new word can be further improved.

In this embodiment, according to the new word judgment function formula, the value score of the candidate word is calculated by using the BM25 value or TF-IDF value of the candidate word in combination with the information entropy and mutual information of the candidate word, and the value score of the candidate word is used to represent the comprehensive characteristics of the candidate word, so that the characteristics of the candidate word can be accurately represented, and the accuracy of determining whether the candidate word is a new word or not, that is, the accuracy of determining the new word is improved.

On the basis of the above embodiment, the present embodiment further describes and optimizes a technical solution, and specifically, in this embodiment, determining whether the candidate word is a new word according to the value score and the similarity includes:

acquiring a first calculation weight parameter and a second calculation weight parameter which correspond to the value score and the similarity respectively;

determining a comprehensive score corresponding to the candidate word according to the value score, the similarity, the first calculation weight parameter and the second calculation weight parameter;

The first calculation weight parameter and the second calculation weight parameter are weights respectively corresponding to the value score and the similarity when the comprehensive score is calculated; the first calculation weight and the second calculation weight may be set according to actual requirements, which is not limited in this embodiment.

In one specific implementation, the first calculation weight and the second calculation weight may be set to be 0.5, respectively, that is, an average value of the value score and the similarity is taken as a composite score of the candidate word. I.e. score_new (n) = (Score) (Value) + Score (Classification))/2; wherein score_new (n) represents the composite Score of the candidate word, score (Value) represents the Value Score of the candidate word, and Score (Classification) represents the similarity of the candidate word to the preset corpus.

Specifically, after determining a first calculation weight parameter and a second calculation weight parameter corresponding to the value score and the similarity respectively, carrying out weighted calculation according to the value score and the first calculation weight parameter, the similarity and the second calculation weight parameter to obtain a comprehensive score corresponding to the candidate word; comparing the comprehensive score with a preset scoring threshold value, and determining the candidate word as a new word if the comprehensive score is greater than or equal to the preset scoring threshold value; and if the comprehensive score is smaller than the preset scoring threshold value, determining that the candidate word is not a new word.

Therefore, according to the method of the embodiment, whether the candidate word is a new word can be determined more accurately and efficiently, namely, the efficiency and the accuracy of determining the new word are improved.

On the basis of the above embodiment, the present embodiment further describes and optimizes the technical solution, and specifically, in this embodiment, a corpus text is obtained, and word segmentation is performed on the corpus text to obtain candidate words corresponding to the corpus text, including:

acquiring a corpus text;

The N-gram model is an algorithm based on a statistical language model, and the basic idea is that the content in the corpus text is subjected to sliding window operation with the size of N according to bytes, so that a byte fragment sequence with the length of N is formed; each byte segment is called a gram, and each byte segment is a candidate word.

In this embodiment, after obtaining a corpus text, word segmentation is performed on the corpus text by using an N-gram model, the corpus text is converted into a byte segment sequence, a dictionary of the N-gram is generated, and word frequencies of candidate words in the byte segment sequence are counted.

In this embodiment, the N-gram model is used to segment the corpus text to obtain candidate words corresponding to the corpus text, and when the N-gram model segments the word, the matching information between adjacent contexts can be used to improve the word segmentation effect.

On the basis of the above embodiment, the technical solution is further described and optimized in this embodiment, and specifically, in this embodiment, a new word determining method further includes:

the value range of the character of the adjusting word is within the preset value range.

Specifically, if the word feature is information entropy, determining that a value range corresponding to the information entropy is a first preset value range, where the first preset value range may be specifically (0, 1).

It should be noted that, because the value range of the information entropy is (0, + -infinity), that is, the value of the information entropy can be infinite, for massive corpus texts, in the process of mining refined new words, when a large number of candidate words have extremely high left-neighbor entropy values and right-neighbor entropy values, the situation that the value of the information entropy is infinite needs to be avoided.

Specifically, the definition of the information entropy is H (X) = - Σp (X) lovp (X);

when the distribution of the random variable X is determined, the value of the information entropy H (X) is the smallest, and at this time, H (X) = -1×log21+0×log … +0×log20=0;

the maximum value of H (X) is calculated by using Lagrangian multiplier, and the maximum value of the objective function is as follows:

maxH(X)＝-x∈X∑p(x)logp(x)；

changing into Lagrangian multiplier standard form: min-H (X) = -X e X Σp (X) logp (X);

the constraint conditions are: Σp (x) =1;

Thus the final lagrangian function: l (X, w) = Σp (X) lovp (X) +w (Σp (X) -1);

the above function is biased, and the vector with the bias of 0 is calculated, so that the following result can be obtained:

∑p(x)＝∑np(xi)＝1；

L(X,w)＝∑p(xi)logp(xi)+w(∑p(xi)-1)；

then:

order theTo give lovp (xi) +1+w=0;

this equation holds for all X1, X2, …, xn, so x1=x2= … =xn;

Σp (xi) =1 before combining, yielding x1=x2= … =xn=1/n; substituting the formula H (X) to obtain the final value of the information entropy:

h (X) = -i=1 Σnp (xi) logp (xi) = -i=1 Σnn1 logn1=logn; the value range is 0-H (x) logn; wherein N corresponds to infinite conditions; the information entropy takes on a value ranging from 0 to positive infinity.

Introduction of standardized informationEntropy ofDetermining the value range corresponding to the information entropy as (0, 1) through standardized processing; wherein 0 means no uncertainty, i.e. no information gain; 1 represents the maximum uncertainty, i.e. the maximum information gain.

Specifically, if the word feature is mutual information, it is determined that the value range corresponding to the mutual information is a second preset value range, and the second preset value range may specifically be (0, 1).

It should be noted that, the value range of mutual information is from minus infinity to plus infinity, namely (-, in +++). The range of values of the mutual information (PMI) depends on the joint probability and the independent probability of the two events, thus taking into account the minimum and maximum values of the PMI. Where when P (X, Y) =0 (i.e. events X and Y never occur simultaneously), the value of PMI is minus infinity. In practice, many event combinations may never be observed for sparse data or large data sets, resulting in PMI values becoming very extreme. The maximum value of PMI depends on the case when two events always occur simultaneously. Considering the extreme case, if the occurrence of X and Y is entirely dependent on each other, i.e. P (X, Y) =p (X) =p (Y), then pmi=log (1/P (X)), in which case the maximum value of PMI will depend on the minimum probability of a single event. When P (X) approaches 0, the value of PMI approaches positive infinity, namely, the PMI has a value range (- ≡, ++ infinity A kind of electronic device.

In this embodiment, standardized mutual information is introducedDetermining the value range corresponding to the mutual information as (0, 1) through standardization processing; where MI (X, Y) represents mutual information between X and Y, and H (X) and H (Y) are the entropy of X and Y, respectively. When MI (X, Y) =0, NMI (X, Y) is also 0, meaning that X and Y are independent of each other, and the two words have no mutual information; when X and Y have complete mutual information, NMI (X, Y) =1, meaning that X and Y are very strongly mutually informative.

In addition, because the value range of the BM25 is (0, 1), the value score can be limited between 0 and 1, so that a suitable threshold point can be conveniently and accurately determined, and the value score of the candidate word can be accurately determined.

According to the method and the device, the value range of the word characteristics is within the preset value range, so that the condition that the value of the information entropy or the value of the mutual information is infinite is avoided, and the value score of the candidate word can be determined more conveniently and efficiently.

Fig. 2 is a schematic process diagram of a new word determining method according to an embodiment of the present application. In order to enable those skilled in the art to better understand the technical scheme of the present application, the technical scheme in the embodiment of the present application is described in detail below in conjunction with practical application scenarios. As shown in fig. 2, in the embodiment of the present application, a new word determining method includes the following specific steps:

Acquiring a corpus text;

word segmentation is carried out on the corpus text by utilizing the N-gram model, and candidate words corresponding to the corpus text are obtained;

respectively calculating the information entropy and the mutual information of each candidate word, and normalizing the value range of the information entropy and the value range of the mutual information into (0, 1); the information entropy comprises a left adjacent entropy value and a right adjacent entropy value;

the BM25 value corresponding to each candidate word is determined, and the BM25 value is determined as the characteristic weight parameter of the corresponding candidate word;

calculating the value score of the candidate word based on the new word judgment function formula according to the left-neighbor entropy value, the right-neighbor entropy value and the mutual information;

converting each candidate word into a word vector (high-dimensional vector) by using BERT to obtain a target corpus;

calculating the similarity between the word vector and a preset corpus (the existing corpus) by using the XGBoost model to obtain the similarity of the candidate words;

calculating the average value of the value score and the similarity of the candidate words to obtain the comprehensive score of the candidate words;

and determining whether the candidate word is a new word according to the comprehensive score of the candidate word.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Fig. 3 is a schematic structural diagram of a new word determining device according to an embodiment of the present application. As shown in fig. 3, the new word determining apparatus of this embodiment includes an acquisition module 310, a first calculation module 320, a second calculation module 330, and a determination module 340; wherein,

an obtaining module 310, configured to obtain a corpus text, and segment the corpus text to obtain candidate words corresponding to the corpus text;

a first calculation module 320, configured to determine a value score of the candidate word according to the word feature and the feature weight parameter of the candidate word;

a second calculation module 330, configured to determine a similarity between the candidate word and a preset corpus;

a determining module 340 is configured to determine whether the candidate word is a new word according to the value score and the similarity.

The new word determining device provided by the embodiment of the application has the same beneficial effects as the new word determining method.

In one embodiment, the second computing module 330 includes:

The conversion sub-module is used for converting the candidate words into word vectors;

and the first computing sub-module is used for determining the similarity between the word vector and a preset corpus by using the classification model.

In one embodiment, the first computing module 320 includes:

the feature determination submodule is used for determining word features of the candidate words; word characteristics include information entropy and mutual information;

the first parameter acquisition sub-module is used for acquiring characteristic weight parameters of word characteristics;

and the second computing sub-module is used for determining the value score of the candidate word according to the word characteristics and the characteristic weight parameters.

In one embodiment, the determining module 340 includes:

the second parameter acquisition sub-module is used for acquiring a first calculation weight parameter and a second calculation weight parameter which correspond to the value score and the similarity respectively;

the third calculation sub-module is used for determining a comprehensive score corresponding to the candidate word according to the value score, the similarity, the first calculation weight parameter and the second calculation weight parameter;

and the determining submodule is used for determining whether the candidate word is a new word according to the comprehensive score.

In one embodiment, the acquisition module 310 includes:

the corpus text obtaining submodule is used for obtaining corpus texts;

And the word segmentation sub-module is used for segmenting the corpus text by utilizing the N-gram model to obtain candidate words corresponding to the corpus text.

In one embodiment, the feature weight parameter is the BM25 value or TF-IDF value of the candidate word.

In one embodiment, a new word determining method further includes:

the range adjusting module is used for adjusting the value range of the word characteristics to be within a preset value range.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 4, the terminal device 400 of this embodiment includes a memory 401, a processor 402, and a computer program 403 stored in the memory 401 and executable on the processor 402; the steps of the embodiments of the method for determining new words described above are implemented by the processor 402 when executing the computer program 403; or the processor 402, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.

By way of example, computer program 403 may be partitioned into one or more modules/units that are stored in memory 401 and executed by processor 402 to implement the methods of embodiments of the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing a specific function, which instruction segments are used to describe the execution of the computer program 403 in the terminal device 400. For example, the computer program 403 may be divided into an acquisition module, a first calculation module, a second calculation module and a determination module, each module specifically functioning as follows:

the acquisition module is used for acquiring the corpus text, and segmenting the corpus text to obtain candidate words corresponding to the corpus text;

The first calculation module is used for determining the value score of the candidate word according to the word characteristics and the characteristic weight parameters of the candidate word;

In application, the terminal device 400 may be a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud server. Terminal device 400 may include, but is not limited to, memory 401 and processor 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of a terminal device and is not meant to be limiting, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., a terminal device may also include an input-output device, a network access device, a bus, etc.; the input and output equipment can comprise a camera, an audio acquisition/play device, a display screen and the like; the network access device may include a communication module for wireless communication with an external device.

In application, the processor may be a central processing unit (Central Processing Unit, CPU), or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In an application, the memory may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device; external storage devices of the terminal device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like, which are provided on the terminal device; it may also comprise both an internal memory unit of the terminal device and an external memory device. The memory is used to store an operating system, application programs, boot Loader (Boot Loader), data, and other programs, etc., such as program code for a computer program, etc. The memory may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application also provide a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the above-described method embodiments.

The computer readable storage medium provided by the embodiment of the application has the same beneficial effects as the method for determining the new words.

The present application may be implemented in whole or in part by a computer program which, when executed by a processor, performs the steps of the method embodiments described above, and which may be embodied in a computer readable storage medium. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a terminal device, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative apparatus and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the embodiments of the apparatus described above are illustrative only, and the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, the apparatus may be indirectly coupled or in communication connection, whether in electrical, mechanical or other form.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A new word determining method, the method comprising:

determining the similarity between the candidate words and a preset corpus;

2. The method of claim 1, wherein the determining the similarity of the candidate word to a pre-set corpus comprises:

Converting the candidate words into word vectors;

3. The method of claim 1, wherein said determining a value score for the candidate word based on the word characteristics and the characteristic weight parameters of the candidate word comprises:

acquiring feature weight parameters of the word features;

4. The method of claim 1, wherein said determining whether the candidate word is a new word based on the value score and the similarity comprises:

5. The method of claim 1, wherein the obtaining the corpus text and word segmentation of the corpus text to obtain the candidate words corresponding to the corpus text comprises:

Acquiring a corpus text;

6. A method according to claim 3, wherein the feature weight parameter is a BM25 value or TF-IDF value of the candidate word.

7. The method according to any one of claims 1 to 6, further comprising:

8. A new word determining apparatus, the apparatus comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.