CN112559694A

CN112559694A - Method and device for discovering new words, computer storage medium and electronic equipment

Info

Publication number: CN112559694A
Application number: CN202110188681.9A
Authority: CN
Inventors: 吴成龙; 李德春
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2021-03-26
Anticipated expiration: 2041-02-19
Also published as: CN112559694B

Abstract

The application provides a method, a device, a computer storage medium and electronic equipment for discovering new words, which relate to natural language processing in the field of artificial intelligence, and the method comprises the steps of after candidate word strings are obtained, utilizing prefixes and suffixes of target word strings (the candidate word strings which are not recorded in a word stock) to carry out information entropy calculation to obtain front information entropy and rear information entropy; calculating an information entropy score by using the front information entropy and the rear information entropy; the information entropy score is inversely related to the association degree of the target word string and the associated context; combining the mutual information value and the information entropy value of the target character string to obtain a new word value of the target character string; and selecting a plurality of target word strings as new words according to the scores of the new words from big to small. According to the scheme, the target word strings without the independent semantics are screened by using the information entropy scores, and the accuracy of finding new words is improved.

Description

Method and device for discovering new words, computer storage medium and electronic equipment

Technical Field

The present invention relates to natural language processing technologies, and in particular, to a method and an apparatus for discovering new words, a computer storage medium, and an electronic device.

Background

With the rapid development of the internet, a large number of new words (referring to words not recorded in a word bank) appear on the network at intervals, and to enable applications such as information retrieval and machine translation to adapt to the new words in time and perform corresponding adjustment, the new words need to be rapidly and accurately found from a corpus set collected within a period of time.

In the conventional new word discovery method, generally, for a word string that is not recorded in a word bank, a mutual information value (frequency of common occurrences in a text that represents a plurality of words included in the word string) of the word string is calculated, and then the word string that has a higher mutual information value and is not recorded in the word bank is used as a new word. However, this method is easy to recognize the word string that cannot be formed into new words in some common phrases, such as the co-occurrence frequency of "degree" and "learning" in the common phrase "deep learning technique," degree "has a high mutual information value, but obviously, this word string cannot be formed into words, so the accuracy of the result of the existing discovery method is low.

Disclosure of Invention

Based on the above drawbacks of the prior art, the present application provides a method, an apparatus, a computer storage medium, and an electronic device for discovering new words, so as to provide a more accurate new word discovery scheme.

A first aspect of the present application provides a method for discovering new words, including:

obtaining a plurality of texts;

for each string length in a preset string length range, splitting the text into candidate strings with the length of the string length; wherein the candidate word string consists of consecutive words in the text;

performing information entropy calculation on the target string by using a prefix of the target string to obtain a front information entropy of the target string, and performing information entropy calculation on the target string by using a suffix of the target string to obtain a rear information entropy of the target string; wherein the target word string refers to a candidate word string which contains a plurality of continuous characters and is not recorded in a word stock; the prefix of the target word string refers to the last character, located before the target word string, in the text to which the target word string belongs; a suffix of the target string, which refers to a first word of the text to which the target string belongs, which is located after the target string;

calculating to obtain an information entropy score according to the front information entropy and the rear information entropy of the target string and the information entropy similarity between the front information entropy and the rear information entropy; wherein the information entropy score is inversely related to the association degree between the target word string and the context to which the target word string belongs; the context to which the target string belongs comprises a prefix and a suffix of the target string;

combining the mutual information value of the target word string with the information entropy value to obtain a new word value of the target word string;

selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings; wherein N is a positive integer.

Optionally, the performing information entropy calculation on the target word string by using the prefix of the target word string to obtain the pre-information entropy of the target word string includes:

counting the occurrence frequency of the prefix of the target string;

calculating the pre-information entropy of the target string according to the occurrence frequency of the prefix;

performing information entropy calculation on the target string by using the suffix of the target string to obtain the post-information entropy of the target string, including:

counting the occurrence frequency of suffixes of the target word strings;

and calculating the back information entropy of the target word string according to the occurrence frequency of the suffix.

Optionally, before selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings, the method further includes:

acquiring a suffix of a first character; wherein the first word refers to a first word of the target word string; a suffix of the first word that refers to a first word that follows the first word in text to which the first word belongs;

carrying out information entropy calculation on the suffix of the first character to obtain a first information entropy;

acquiring a prefix of a second character; wherein the second word refers to a last word of the target string; the prefix of the second word refers to the last word before the second word in the text to which the second word belongs;

performing information entropy calculation on the prefix of the second character to obtain a second information entropy;

subtracting the minimum value of the first information entropy and the second information entropy from the new word score of the target word string to obtain a corrected new word score of the target word string;

the selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings comprises the following steps:

and selecting the first N target word strings as new words according to the descending order of the scores of the new words after the target word strings are corrected.

obtaining the occurrence times of the target word string;

adjusting the new word score of the target word string according to the occurrence frequency of the target word string;

and selecting the first N target word strings as new words according to the descending order of the scores of the new words of the target word strings.

Optionally, the method further includes:

outputting screening prompt information; wherein the screening prompt message comprises the new word; and the screening prompt information is used for prompting the user to screen the new words.

A second aspect of the present application provides an apparatus for discovering new words, including:

an obtaining unit configured to obtain a plurality of texts;

the word segmentation unit is used for segmenting the text into candidate word strings with the length being the length of each word string in a preset word string length range; wherein the candidate word string consists of consecutive words in the text;

the first calculation unit is used for performing information entropy calculation on the target string by using a prefix of the target string to obtain a front information entropy of the target string, and performing information entropy calculation on the target string by using a suffix of the target string to obtain a rear information entropy of the target string; wherein the target word string refers to a candidate word string which contains a plurality of continuous characters and is not recorded in a word stock; the prefix of the target word string refers to the last character, located before the target word string, in the text to which the target word string belongs; a suffix of the target string, which refers to a first word of the text to which the target string belongs, which is located after the target string;

the second calculating unit is used for calculating to obtain an information entropy score according to the front information entropy and the rear information entropy of the target string and the information entropy similarity between the front information entropy and the rear information entropy; wherein the information entropy score is inversely related to the association degree between the target word string and the context to which the target word string belongs; the context to which the target string belongs comprises a prefix and a suffix of the target string;

the merging unit is used for merging the mutual information value of the target character string and the information entropy value to obtain a new word value of the target character string;

the selecting unit is used for selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings; wherein N is a positive integer.

Optionally, when the first computing unit performs information entropy computation on the target word string by using the prefix of the target word string to obtain the pre-information entropy of the target word string, the first computing unit is specifically configured to:

counting the occurrence frequency of the prefix of the target string;

the first calculating unit is specifically configured to, when performing information entropy calculation on the target string by using the suffix of the target string to obtain a post-information entropy of the target string:

counting the occurrence frequency of suffixes of the target word strings;

Optionally, the apparatus further includes a correction unit, configured to:

the selecting unit is specifically configured to, when selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings, specifically:

A third aspect of the present application provides a computer storage medium for storing a computer program, which, when executed, is particularly adapted to implement the method for new word discovery provided in any one of the first aspects of the present application.

A fourth aspect of the present application provides an electronic device comprising a memory and a processor;

wherein the memory is for storing a computer program;

the processor is configured to execute the computer program, and in particular, to implement the method for discovering new words provided by any one of the first aspects of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for discovering a new word according to an embodiment of the present application;

fig. 2 is a flowchart of another method for discovering new words according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus for discovering a new word according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. That is to say, artificial intelligence is a comprehensive technology of computer science, and the design principle and implementation method of various intelligent machines are specifically studied, so that the machines have the functions of perception, reasoning and decision making.

As a comprehensive subject, the artificial intelligence technology relates to various technical fields including a hardware layer and a software layer, wherein the artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The application relates to the natural language processing technology in the technical field of artificial intelligence software.

Natural Language Processing (NLP) has been mainly studied on various theories and methods for realizing efficient communication between a person and a computer using natural Language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. The present application relates to text processing and semantic understanding branches of natural language processing technologies, and in particular, to a method for mining new words from natural language texts in a period of time, that is, a method for discovering new words.

It should be noted that the method for discovering new words provided by the present application can also be applied to a blockchain system. Specifically, the natural language text provided by the user in the blockchain system over a period of time (e.g., the last month) may be collected to obtain a text set, and the new word discovery method provided by the present application may be applied to the text set to find out new words appearing in the blockchain system over the period of time.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise functional modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.

In the working process of each function module, a certain amount of natural language text may be obtained, for example, in order to provide easy-to-understand notes for each item of data in the blockchain system, when the user management module implements the wind control audit function, a natural language text describing rules of risk control may be recorded, the service information obtained by the basic service module may include a natural language text describing a service, each intelligent contract managed by the intelligent contract module, and a natural language text describing specific contents of the intelligent contract, and the natural language texts may all be input by a user when the user uses the blockchain system.

Because various services often appear in the block chain, new words for describing certain novel services may appear in the corresponding natural language texts, and the method for finding the new words provided by the application can collect the natural language texts which are input by the user in the block chain system within the latest period of time, and dig out the new words appearing within the latest period of time from the natural language texts so as to carry out statistics on the service types recently processed by the block chain system in the following period of time.

Referring to fig. 1, a method for discovering a new word provided by an embodiment of the present application may include the following steps:

s101, obtaining a plurality of texts.

Generally, the method for discovering new words provided by the embodiment of the present application may be executed once every update period of a thesaurus, and the update period of the thesaurus may be set according to actual situations, for example, set to be updated once every two weeks, or once every one week. Therefore, the text obtained in step S101 may include the sentence uploaded by the user in the latest update cycle.

For example, in a social networking platform, a user may complain about other users who have a suspected fraudulent violation, and when the complaint is reported, the user who initiated the complaint may explain the complaint reason and give feedback in text form. Accordingly, the texts obtained in step S101 may be the complaint texts received in the latest update cycle. By identifying the new words in the complaint texts, important bases can be provided for subsequent semantic analysis of the complaint texts and anti-fraud identification based on semantic analysis results, for example, whether a complaint is a dispute complaint or a malicious complaint can be judged according to the latest suspicious user fraud type.

Examples of some complaint texts are as follows:

text 1: "the pornographic picture is broadcast in the circle of friends, and the pornographic picture is also given to us".

Text 2: "his circle of friends has many purchase links to pirated goods".

Text 3: "he forwards fraud information to many friends".

S102, performing word segmentation processing on the text to obtain a plurality of candidate word strings contained in the text.

Wherein, the candidate word string comprises continuous M characters; m is an integer greater than 1 and less than or equal to a preset generation length.

Alternatively, M may be each integer satisfying a condition that is greater than or equal to 1 and less than or equal to a preset string length, for example, if the string length is set to 3, M may be equal to 2 and 3, and if the string length is set to 2, M is equal to 2.

When M is equal to 2, step S102 is to split each text obtained in step S101 into a plurality of candidate strings formed by combining two consecutive characters, specifically, each two consecutive characters in the text of step S101 are regarded as a candidate string.

When M is equal to 2 and 3, step S102 is to split each text obtained in step S101 into a candidate string composed of two consecutive words and into a candidate string composed of 3 consecutive words.

The continuous words, which refer to two adjacent words without other characters in the middle of a sentence, in combination with the example of step S101, the candidate word string obtained by splitting the text 1 and containing the word number 2 may include: { friends, circles, spreads, color, pornography, … … }.

Splitting the candidate string containing word number 2 where text 2 is split may include: { his, friends, circles, very many, … … }.

Splitting the candidate string containing word number 2 for text 3 may include: { He went, many, friends, friend turn, … … }.

The candidate word strings obtained by word segmentation can be represented by wi, where i represents that the candidate word string is the candidate word string obtained by the i-th recognized word segmentation in all the obtained texts.

Alternatively, a range of string lengths, such as 1 to 3, may be preset, and then the text is split into candidate strings having a length equal to the length of each string in the range.

For example, if the length of the character string ranges from 1 to 3, in step S102, each obtained text may be first segmented according to the length 1 to obtain a plurality of candidate character strings containing only one character, then segmented according to the length 2 to obtain a plurality of candidate character strings containing two consecutive characters, and then segmented according to the length 3 to obtain a plurality of candidate character strings containing three consecutive characters. And (4) the candidate word strings with all the lengths obtained by word segmentation are used for subsequent calculation.

S103, performing information entropy calculation on the target character string by respectively using the prefix and the suffix of the target character string to obtain the front information entropy and the rear information entropy of the target character string.

The target word string refers to a candidate word string which is not recorded in the word stock in a plurality of candidate word strings; the prefix of the target word string refers to the last character which is positioned before the target word string in the text to which the target word string belongs; and the suffix of the target string refers to the first word after the target string in the text to which the target string belongs.

In the candidate strings obtained by the word segmentation in step S102, there may be a plurality of target strings. Accordingly, in step S103, for each target string, the pre-information entropy and the post-information entropy of the target string are calculated.

For a specific target string, information entropy calculation may be performed on a prefix of the target string to obtain a pre-information entropy of the target string, and information entropy calculation may be performed on a suffix of the target string to obtain a post-information entropy of the target string.

After obtaining the candidate strings in step S102, each of the appearing candidate strings may be compared with the recorded vocabulary in the lexicon, and if there is no completely identical vocabulary in the lexicon for any candidate string, the candidate string is determined as a target string, for example, if the candidate string of "friend" in the foregoing example does not find completely identical vocabulary in the lexicon, the candidate string of "friend" is identified as the target string.

The information Entropy (Entropy) is used for measuring the size of the information quantity of a certain variable, and if the uncertainty of the certain variable is larger, the larger the information quantity is, the larger the Entropy is; if the uncertainty is smaller, the amount of information is smaller and the entropy is smaller. In the embodiment of the present application, for a specific target string, the prefix and the suffix of the string both correspond to a variable, and in different texts, the prefix and the suffix of the string may be specific to different characters, and the specific characters in the text correspond to specific values of the prefix and the suffix.

For a variable, the larger the value range of the variable (i.e., the more variable values that the variable can take), the closer the probability of reaching each variable value, the larger the information entropy of the variable, and vice versa.

For a target string (denoted as wi), the pre-information entropy can be calculated by:

first, the frequency of occurrence of the prefix of the target string is counted.

For a target word string wi, each word immediately preceding the target word string (i.e. the last word before the target word string) is a prefix of the target word string, and the frequency of occurrence Pf (wi, k) of a certain prefix k of the target word string is equal to:

where Nf (wi, k) represents the number of times that the word with prefix k is the prefix of the target word string in all the texts obtained previously, and Nf (wi) represents the number of times that the target word string wi has the prefix in all the texts obtained previously.

In connection with the example of step S101, the "friend" string has no prefix in text 1, the "friend" string has a prefix "in text 2, and the" friend "string has a prefix" many "in text 3.

It can be seen that, in the 3 texts in the example of step S101, the number of times that the "friend" string (denoted by wi) has a prefix is 2, that is, nf (wi) is equal to 2, the prefixes of the "friend" string have two prefixes, namely, "friend" (denoted as prefix 1) and "friend" (denoted as prefix 2), and the two prefixes of the "friend" string have 2 times, so that Pf (wi, 1) is equal to 0.5, and Pf (wi, 2) is equal to 0.5.

Further, assume that text 4 "his circle of friends released many rumors" is also obtained in step S101. Then the "friend" string (denoted by wi) has a prefix number of 3 in all the texts, where "occurrence number of prefix 1" is 2 and "multiple" occurrence number of prefix 2 is 1, so that Pf (wi, 1) is equal to 2/3 and Pf (wi, 2) is equal to 1/3.

After the occurrence frequency of the prefix is obtained, the pre-information entropy of the target string can be calculated according to the occurrence frequency of the prefix by using the following information entropy formula.

The information entropy formula is as follows:

where K represents the number of categories of prefixes that the target string wi has, and "many" of "friend" strings having two types of prefixes "in all the texts acquired in step S101, so K is equal to 2. Hf (wi) represents the pre-information entropy of the target string wi.

The calculation method of the post information entropy is similar to that of the pre information entropy, and still taking the aforementioned 3 texts and the "friend" string as an example, the occurrence frequency of the suffix of the "friend" string can be counted first.

Specifically, in the aforementioned 3 texts, the "friend" string (denoted by wi) has a suffix of 3 times, which includes two suffixes, i.e., "circle" (denoted by suffix 1) and "turn" (denoted by suffix 2), wherein suffix 1 occurs 2 times and suffix 2 occurs 1 times, so that the frequency of occurrence Pa (wi, 1) of suffix 1 is equal to 2/3 and the frequency of occurrence Pa (wi, 2) of suffix 2 is equal to 1/3.

Substituting the occurrence frequency of the suffix of the target word string into an information entropy formula, and calculating the postinformation entropy of the target word string according to the occurrence frequency of the suffix:

where ha (wi) represents the post-information entropy of the target string wi.

S104, calculating to obtain an information entropy score according to the front information entropy and the rear information entropy of the target string and the information entropy similarity between the front information entropy and the rear information entropy.

The information entropy score is inversely related to the association degree between the target character string and the context to which the target character string belongs; the context to which the target string belongs includes a prefix and a suffix of the target string.

The entropy score of the target string can be denoted as hs (wi), and can be specifically calculated by the following formula:

sh (wi) represents information entropy similarity of the target string wi, that is, similarity between pre-information entropy and post-information entropy representing the target string. Alternatively, sh (wi) may be calculated using the following equation:

that is, the information entropy similarity of a target string may be equal to the minimum value of the pre-information entropy and the post-information entropy of the target string, divided by the maximum value of the pre-information entropy and the post-information entropy of the target string.

As can be seen from the above formula for calculating the entropy value, the larger the values of the pre-entropy and the post-entropy of a target string are, and the closer the two values are, the higher the entropy value of the target string is.

Further, the larger the similarity between the front information entropy, the rear information entropy and the information entropy of a target string is, the target string can be used with a plurality of different characters, that is, the target string can be applied to a plurality of different contexts, and the degree of association between the target string and each context to which the target string belongs is low, that is, the target string can be used independently without any context.

The higher the probability that a target string satisfying the above condition has independent and complete semantics, in other words, the more likely the target string is a vocabulary rather than a meaningless combination of partial characters in a phrase.

It should be noted that the formula for calculating the information entropy score is not limited to the above formula, and other formulas may also be used to calculate the information entropy score, and it is only necessary to ensure that the size of the information entropy score is positively correlated with the pre-information entropy of the target string, and that the post-information entropy is positively correlated with the similarity of the information entropy, for example, the information entropy score may also be calculated by the following formula:

s105, combining the mutual information value and the information entropy value of the target character string to obtain a new word value of the target character string.

It is understood that the process of calculating the new word score of the target word string from step S102 to step S105 may be performed on each target word string obtained by the word segmentation in step S102, so as to obtain a new word score corresponding to each target word string.

As described in the background, the Mutual information value of the target string, which is used to represent the frequency of the common occurrence of multiple words contained in the string in the text, specifically, for any target string wi, the Mutual information value pmi (position Mutual information) (wi) can be calculated by the following formula:

wherein p (wi) represents the frequency of occurrence of all the strings obtained by segmenting the text in step S102.

For example, assuming that the word segmentation of all the obtained texts in step S102 results in 30 word strings, which include 3 "friend" word strings, i.e. the "friend" word string appears 3 times, the frequency of occurrence of the "friend" word string is equal to 1/10.

M is the number of words included in the target string, j represents the jth word in the target string, and p (j) represents the frequency of occurrence of the jth word in the target string in all the obtained texts, for example, the text obtained in step S101 includes 50 words, where the jth word in the target string appears 6 times, and the frequency of occurrence p (j) of the jth word in the target string is 0.12.

The formula for calculating the mutual information value means that the occurrence frequency of the target string is divided by the product of the occurrence frequency of each character of the target string, and then the obtained ratio is logarithmized, and the final result is the mutual information value of the target string wi.

In the merging step S105, the entropy score of the target string and the mutual information value may be directly added to obtain a new word score.

Therefore, for a target word string wi, the new word score Si is:

s106, selecting the first N target character strings as new words according to the descending order of the new word scores of the target character strings.

Wherein N is a positive integer.

For example, if N is set equal to 10, in step S106, the top 10 target word strings with the largest new word score may be selected as new words to be added to the word bank.

Optionally, in order to avoid erroneously recognizing some word strings that cannot form words as new words, screening prompt information may be output to terminal devices of the user and other technicians, where the screening prompt information carries the new words recognized in step S106, and after the screening prompt information is displayed on the terminal device, the screening prompt information may prompt the corresponding personnel to screen the new words recognized in step S106, delete the new words that cannot form words, and finally add the remaining new words after screening in the word bank.

The method for discovering the new words provided by the embodiment of the application has the following beneficial effects:

in the first aspect, for any word string, if the word string can independently form a vocabulary, the word string should be usable with multiple words under multiple contexts, that is, the word string should have a low degree of association with its associated context, so that the word string can be expressed semantically independent of its associated context, and therefore, in a large amount of text, the word string should have multiple prefixes and suffixes, wherein the frequency of occurrence of each prefix is relatively close, and the frequency of occurrence of each suffix is relatively close.

Based on this, when a new word is found, the method provided in this embodiment comprehensively considers the mutual information value and the information entropy value of the target word string, and only identifies a new word under the condition that a target word string has a higher occurrence frequency (i.e., has a larger mutual information value) and the front information entropy and the rear information entropy are higher (i.e., has a larger information entropy value), thereby avoiding identifying a part of word strings, which do not form words, in some word groups with higher occurrence frequencies as new words, and improving the accuracy of the new word finding result.

In a second aspect, as can be seen from the foregoing calculation formulas of the information entropy scores and the new word scores, in this embodiment, the new word score is also related to the information entropy similarity of the target word string, and the higher the information entropy similarity of the target word string is, the more likely it is to be recognized as a new word, whereas if the information entropy similarity is lower, the target word string is not recognized as a new word.

In this way, the first half or the second half of a part of common expressions can be prevented from being recognized as a new word, for example, if a large amount of common expressions such as "public numbers" appear in the obtained text, the word string "public numbers" may have a large back information entropy, which results in a large sum of the front information entropy and the back information entropy in the information entropy scores, but for the "public number" word string, the prefix thereof only contains "public", the front information entropy is far smaller than the back information entropy, so that the information entropy similarity is small, and the finally obtained information entropy scores and the new word scores of the "public number" word string are also small, thereby preventing the "public numbers" from being recognized as a new word.

Referring to fig. 2, another embodiment of the present application further provides a method for discovering new words, which may include the following steps:

s201, obtaining a plurality of texts.

S202, performing word segmentation processing on the text to obtain a plurality of candidate word strings contained in the text.

S203, respectively utilizing the prefix and the suffix of the target string to carry out information entropy calculation to obtain the front information entropy and the rear information entropy of the target string.

S204, calculating according to the front information entropy and the rear information entropy of the target string and the information entropy similarity between the front information entropy and the rear information entropy to obtain an information entropy score.

S205, combining the mutual information value and the information entropy value of the target character string to obtain a new word value of the target character string.

The specific implementation of steps S201 to S205 is the same as that of steps S101 to S105 in the foregoing embodiment, and will not be described in detail here.

S206, obtaining the occurrence frequency of the target character string, and adjusting the new word score of the target character string downwards according to the occurrence frequency of the target character string to obtain the adjusted new word score.

For any target word string wi, an optional method for adjusting the new word score downward may be:

wherein st (wi) represents a new word score after the target word string wi is down-regulated, K is a down-regulation coefficient, is a preset integer, and can be regulated according to a scene, specifically, can be set to a numerical value of 10, 20, 30, etc., tanh represents a hyperbolic tangent function, the function has properties of single increment and single increment of an inverse function, and is widely applied to a neural network activation function, and an expression of tanh [ n (wi)/K ] is as follows:

n (wi) represents the number of times the target word string wi appears in all the acquired texts, and the hyperbolic tangent function has the characteristic that the calculated function value is always smaller than 1 no matter how large the input variable, i.e., n (wi)/K, and the output function value of the hyperbolic tangent function monotonically increases as n (wi)/K increases, specifically as n (wi) increases.

That is, in step S206, the more frequently appearing target strings have new word scores that are relatively smaller in downward adjustment, and conversely, the less frequently appearing target strings have new word scores that are relatively larger in downward adjustment.

S207, selecting the first N target character strings as new words according to the descending order of the scores of the new words of the target character strings.

The effect of down-regulating the new word score according to the occurrence frequency of the target word string is as follows:

in an actual application scene, a plurality of new words can emerge in each updating period, however, some new words are only used in a few specific scenes, the use frequency is low, and the part of new words are likely not to act in the subsequent semantic analysis and mining process based on the word stock due to the low use frequency, so that the storage space of the word stock can be wasted by recording the part of new words in the word stock, and the score of the part of new words meeting the conditions is reduced by using the occurrence frequency of the target word string, so that the situation that the words with too low use frequency are added into the word stock is avoided, and the utilization rate of the storage space of the word stock is improved.

Optionally: prior to executing step S206, the new word score may also be modified by the following method

A suffix of the first letter is obtained.

Wherein, the first character refers to the first character of the target character string; the suffix of a first word refers to the first word that follows the first word in the text to which the first word belongs.

And carrying out information entropy calculation on the suffix of the first character to obtain a first information entropy.

And acquiring the prefix of the second character.

Wherein, the second character refers to the last character of the target string; the prefix of the second word refers to the last word before the second word in the text to which the second word belongs.

And performing information entropy calculation on the prefix of the second character to obtain a second information entropy.

Taking a "friend" word string as an example, a "friend" is a first word of the "friend" word string, the first information entropy is substantially obtained by performing information entropy calculation on suffixes of "friend" words in all acquired texts, the "friend" is a second word of the "friend" word string, and the second information entropy is substantially obtained by performing information entropy calculation on prefixes of "friend" words in all acquired texts.

The specific processes of performing information entropy calculation on the suffix of the first character and performing information entropy calculation on the prefix of the second character can refer to the calculation processes of the pre-information entropy and the post-information entropy in the foregoing embodiments, and are not described in detail here.

And subtracting the minimum value of the first information entropy and the second information entropy from the new word score of the target word string to obtain the corrected new word score of the target word string.

Specifically, for the target word string wi, the first information entropy is denoted as hl (wi), the second information entropy is denoted as hr (wi), and the modified new word score may be denoted as sd (wi), where:

it should be noted that the correction method and the down-regulation process in step S205 may be executed separately or may be executed in combination in one embodiment.

Specifically, if only the above correction method is performed, after the corrected new word score of each target word string is obtained, the first N target word strings may be selected as new words according to the descending order of the corrected new word scores.

If the correction method and the down-adjustment process in step S206 are combined and executed in one embodiment, the correction method may be executed first to obtain the corrected new word score of each target word string, then the corrected new word score is further down-adjusted according to the down-adjustment process in step S206, and finally, N target word strings are selected as new words according to the down-adjusted new word scores from large to small.

In step S206, according to different set down-adjustment coefficients, the score st (wi) of the new word of the target word string wi after being down-adjusted changes accordingly, and the new word obtained by final screening may also have a certain difference, and several examples of the new word obtained by mining according to the method of the present application under different down-adjustment coefficients are given below.

Table 1 below shows that, for a given group of text mined new words when the down-regulation coefficient K in step S206 is 50, the set number N of the new words is 20, that is, the first 20 target word strings are selected as the new words in the descending order of the score value st (wi) of the new words:

TABLE 1

ID	wi	St（wi）	ID	wi	St（wi）
						1	Female girl	11.2612	11	Lip glaze	8.2718
2	Music score	10.9971	12	Acne removing device	8.2294
						3	Luoli	10.2106	13	Field of strange disease	8.1115
4	Tremble sound	10.2042	14	Common perilla herb	8.0925
						5	Attack in the wrong direction	10.1889	15	Fool force type	7.8839
6	Chinese saw blades	9.9545	16	Android device	7.8119
						7	Penis et testis Cervi	9.9442	17	Adopt	7.7241
8	Cake made of glutinous rice flour	9.6200	18	Mine pond	7.6300
						9	Crowd funding	8.9583	19	Beautiful pupil	7.5608
10	Summer-heat autumn	8.8692	20	Racing lottery	7.5059

In table 1, wi represents the target word string, st (wi) represents the new word score of the left target word string obtained after being down-adjusted in step S206, and ID represents the sequence of the corresponding target word strings sorted from large to small according to the new word score.

Table 2 below shows the first 20 target word strings with larger scores, i.e. the new words obtained by mining the same batch of text (i.e. the text used for mining is still the text used in table 1), when the key-down coefficient K is set to 100:

TABLE 2

ID	wi	St（wi）	ID	wi	St（wi）
						1	Female girl	11.2584	11	Android device	7.8119
2	Music score	10.9971	12	Field of strange disease	7.7836
						3	Tremble sound	10.2042	13	Mine pond	7.2707
4	Penis et testis Cervi	9.8636	14	Secret bean	7.2608
						5	Luoli	9.7016	15	Imporopels	7.2574
6	Attack in the wrong direction	9.6410	16	Racing lottery	7.1907
						7	Chinese saw blades	9.6290	17	Shunfeng	7.1491
8	Crowd funding	8.9583	18	Male with dregs	7.1241
						9	Summer-heat autumn	8.4896	19	Brain disability	7.1103
10	Fool force type	7.8839	20	Home textile	6.9810

The meaning of each column of data in table 2 is the same as table 1. Comparing table 1 and table 2, it can be seen that when the key-down factor K is set to 50, the strings of "home textile" with ID 20 and "shunfeng" with ID 17 in table 2 are not included in the new words obtained by mining, i.e. these strings do not appear in table 1, and when K is set to 100, the two strings are recognized as new words.

Therefore, in the embodiment shown in fig. 2, the same batch of texts can be repeatedly mined by adjusting the value of the down-regulation coefficient K, so as to obtain more new words.

For example, when the down-regulation coefficient K is set to 150, 20 new words as shown in table 3 can be mined from the same batch of text:

TABLE 3

ID	wi	St（wi）	ID	wi	St（wi）
						1	Female girl	11.2578	11	Android device	7.8119
2	Music score	10.9971	12	Field of strange disease	6.9679
						3	Tremble sound	10.2042	13	Mine pond	6.4441
4	Penis et testis Cervi	9.4488	14	Secret bean	7.2199
						5	Luoli	8.5650	15	Imporopels	6.6003
6	Attack in the wrong direction	8.4658	16	Racing lottery	6.4215
						7	Chinese saw blades	8.7291	17	Shunfeng	7.1474
8	Crowd funding	8.9580	18	Male with dregs	6.8824
						9	Summer-heat autumn	7.5721	19	Brain disability	6.7103
10	Fool force type	7.8839	20	Home textile	6.0699

When the down-regulation coefficient K is set to 200, 20 new words shown in table 4 can be obtained by mining the same batch of text:

TABLE 4

ID	wi	St（wi）	ID	wi	St（wi）
						1	Female girl	11.2509	11	Shunfeng	7.1323
2	Music score	10.9970	12	Secret bean	7.0799
						3	Tremble sound	10.2042	13	WeChat	6.9281
4	Crowd funding	8.9536	14	Sweep sign indicating number	6.8346
						5	Penis et testis Cervi	8.7495	15	Mi le	6.7629
6	Fool force type	7.8839	16	Anchor (R)	6.6127
						7	Android device	7.8117	17	Summer-heat autumn	6.5679
8	Chinese saw blades	7.6680	18	Descending right	6.4423
						9	Luoli	7.3734	19	Male with dregs	6.4385
10	Attack in the wrong direction	7.2600	20	Zero strip	6.3399

When the down-regulation coefficient K is set to 400, 20 new words shown in table 5 can be obtained by mining the same batch of text:

TABLE 5

ID	wi	St（wi）	ID	wi	St（wi）
						1	Music score	10.9441	11	Anchor (R)	6.3455
2	Female girl	10.8554	12	Ore machine	6.1522
						3	Tremble sound	10.2042	13	Flower	5.9325
4	Crowd funding	8.6705	14	Penis et testis Cervi	5.9307
						5	Fool force type	7.8839	15	Descending right	5.9276
6	Android device	7.7595	16	Secret bean	5.7878
						7	WeChat	6.9281	17	Bag post	5.5050
8	Sweep sign indicating number	6.8346	18	Appointment big gun	5.4990
						9	Mi le	6.7585	19	Word language	5.4074
10	Shunfeng	6.6738	20	Give you	5.2468

The data in tables 3 to 5 have the same meanings as in table 1.

In combination with the method for discovering new words provided by the embodiment of the present application, an embodiment of the present application further provides a device for discovering new words, please refer to fig. 3, where the device may include the following units:

an obtaining unit 301 for obtaining a plurality of texts.

The word segmentation unit 302 is configured to segment the text into candidate word strings with lengths equal to the length of each word string in a preset word string length range.

Wherein, the candidate word string is composed of continuous characters in the text.

The first calculating unit 303 is configured to perform information entropy calculation on the target string by using the prefix of the target string to obtain a pre-information entropy of the target string, and perform information entropy calculation on the target string by using the suffix of the target string to obtain a post-information entropy of the target string.

The target word string refers to a candidate word string which contains a plurality of continuous characters and is not recorded in a word stock; the prefix of the target word string refers to the last character which is positioned before the target word string in the text to which the target word string belongs; and the suffix of the target string refers to the first word after the target string in the text to which the target string belongs.

The second calculating unit 304 is configured to calculate an information entropy score according to the front information entropy and the rear information entropy of the target string and an information entropy similarity between the front information entropy and the rear information entropy.

The merging unit 305 is configured to merge the mutual information value and the information entropy value of the target word string to obtain a new word value of the target word string.

The selecting unit 306 is configured to select the first N target word strings as new words according to the descending order of the new word scores of the target word strings.

Wherein N is a positive integer.

Optionally, when the first calculating unit 303 performs information entropy calculation on the target string by using the prefix of the target string to obtain the pre-information entropy of the target string, the first calculating unit is specifically configured to:

counting the occurrence frequency of the prefix of the target string;

the first calculating unit 303 is specifically configured to, when performing information entropy calculation on the target string by using the suffix of the target string to obtain the post-information entropy of the target string:

counting the occurrence frequency of suffixes of the target word strings;

and calculating the postinformation entropy of the target word string according to the occurrence frequency of the postfix.

Optionally, the apparatus further includes a correction unit 307, configured to:

when the selecting unit 306 selects the first N target word strings as new words according to the descending order of the new word scores of the target word strings, the selecting unit is specifically configured to:

and selecting the first N target character strings as new words according to the descending order of the scores of the new words after the target character strings are corrected.

Optionally, the modification unit 307 may be further configured to:

acquiring the occurrence times of the target character string;

and selecting the first N target character strings as new words according to the descending order of the scores of the new words of the target character strings.

Optionally, the apparatus may further include:

and an output unit 308 for outputting the screening prompt information.

And the screening prompt information comprises new words and is used for prompting the user to screen the new words.

The specific working principle of the device for discovering new words provided by the embodiment of the present application may refer to the relevant steps in the method for discovering new words provided by the embodiment of the present application, and details are not described here.

The application provides a device for discovering new words, which relates to natural language processing in the field of artificial intelligence, wherein a word segmentation unit 302 segments a text into word strings, a first calculation unit 303 performs information entropy calculation on prefixes and suffixes of target word strings (which refer to the word strings which are not recorded in a word stock) to obtain front information entropy and rear information entropy, and the prefixes refer to last words located before the target word strings in the text; the suffix refers to the first word in the text that follows the target string; the second calculating unit 304 calculates an information entropy score according to the pre-information entropy and the post-information entropy, and the merging unit 305 merges the mutual information value and the information entropy score of the target string to obtain a new word score of the target string; the selecting unit 306 selects a plurality of target word strings as new words according to the scores of the new words from large to small. The prefixes and the suffixes of the strings in the common phrases are fixed, so that the front information entropy and the rear information entropy are small, the strings in the common phrases are screened out by using the front information entropy and the rear information entropy, the strings are prevented from being recognized as new words, and the accuracy of finding results is improved.

The embodiments of the present application further provide a computer storage medium for storing a computer program, where the computer program is specifically configured to implement the method for discovering new words provided in any embodiment of the present application when executed.

Referring to fig. 4, the electronic device includes a memory 401 and a processor 402.

The memory 401 is used for storing a computer program, among other things.

The processor 402 is configured to execute the computer program, and in particular, is configured to implement the method for discovering new words provided by any of the embodiments of the present application.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method for new word discovery provided in the various alternative implementations of any of the aspects.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for new word discovery, comprising:

obtaining a plurality of texts;

2. The method according to claim 1, wherein performing entropy calculation on the target string using the prefix of the target string to obtain pre-entropy of the target string, includes:

counting the occurrence frequency of the prefix of the target string;

counting the occurrence frequency of suffixes of the target word strings;

3. The method according to claim 1, wherein before selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings, further comprising:

4. The method according to claim 1, wherein before selecting the first N target word strings as new words according to the descending order of the new word scores of the target word strings, further comprising:

obtaining the occurrence times of the target word string;

5. The method of any one of claims 1 to 4, further comprising:

6. An apparatus for new word discovery, comprising:

an obtaining unit configured to obtain a plurality of texts;

7. The apparatus according to claim 6, wherein the first computing unit, when performing entropy computation on the target string using a prefix of the target string to obtain a pre-entropy of the target string, is specifically configured to:

counting the occurrence frequency of the prefix of the target string;

counting the occurrence frequency of suffixes of the target word strings;

8. The apparatus according to claim 6, further comprising a correction unit for:

9. A computer storage medium for storing a computer program which, when executed, is particularly adapted to implement a method of new word discovery as claimed in any one of claims 1 to 5.

10. An electronic device comprising a memory and a processor;

wherein the memory is for storing a computer program;

the processor is configured to execute the computer program, in particular to implement the method of new word discovery according to any of claims 1 to 5.