CN111680146A - Method and device for determining new words, electronic equipment and readable storage medium - Google Patents

Method and device for determining new words, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111680146A
CN111680146A CN202010525541.1A CN202010525541A CN111680146A CN 111680146 A CN111680146 A CN 111680146A CN 202010525541 A CN202010525541 A CN 202010525541A CN 111680146 A CN111680146 A CN 111680146A
Authority
CN
China
Prior art keywords
word
sample
text
text set
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010525541.1A
Other languages
Chinese (zh)
Inventor
刘志煌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010525541.1A priority Critical patent/CN111680146A/en
Publication of CN111680146A publication Critical patent/CN111680146A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a method and a device for determining new words, electronic equipment and a readable storage medium. The method comprises the following steps: acquiring a sample text set; carrying out word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths; determining each supersequence in the frequent word sequences corresponding to each length; for each supersequence, if the supersequence is not contained in the tokens contained in the sample text set, the supersequence is determined to be a new word. In the embodiment of the application, characters, words or phrases which are frequently updated can be better screened out by adopting a character sequence mining mode, and the method has important reference value and practical significance in applications such as word segmentation and new word discovery; in addition, in the process of determining the new words, a complex neural network model does not need to be trained, and training samples do not need to be labeled manually, so that the training cost is effectively reduced.

Description

Method and device for determining new words, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining a new word, an electronic device, and a readable storage medium.
Background
With the development of languages and the constantly changing derivation of internet words, new words and new professional words come into the endlessly. For many tasks of natural language processing, the quality of word segmentation plays a crucial role in the accuracy of subsequent task streams, and is also an important influence factor of task effects of other basic tools (such as syntactic analysis, keyword extraction, and the like). Actually, the network new words, names of people and places, professional terms and the like are frequently split by mistake on the existing word segmentation tool, namely, the word segmentation result is not accurate, and the root of the word segmentation tool is also a problem caused by wrong recognition of the new words. Therefore, how to mine new words is an important problem to be solved.
Disclosure of Invention
The present application aims to solve at least one of the above technical drawbacks.
In one aspect, an embodiment of the present application provides a method for determining a new word, where the method includes:
acquiring a sample text set;
carrying out word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths;
determining each supersequence in the frequent word sequences corresponding to each length;
for each supersequence, if the supersequence is not contained in the tokens contained in the sample text set, the supersequence is determined to be a new word.
On the other hand, an embodiment of the present application provides a text processing method, including:
acquiring a text to be processed;
and performing word segmentation processing on the text to be processed based on the word segmentation database to obtain the words included in the text to be processed, wherein the word segmentation database comprises the new words determined by the method in the first aspect.
In another aspect, an embodiment of the present application provides an apparatus for determining a new word, where the apparatus includes:
the text acquisition module is used for acquiring a sample text set;
the sequence mining module is used for carrying out word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths;
a super sequence determining module for determining each super sequence in the frequent word sequences corresponding to each length;
and the new word determining module is used for determining the supersequence as a new word if the supersequence is not contained in each participle contained in the sample text set.
In another aspect, an embodiment of the present application provides a text processing apparatus, including:
the text acquisition module is used for acquiring a text to be processed;
and the word segmentation processing module is used for performing word segmentation processing on the text to be processed based on the word segmentation database to obtain the words included in the text to be processed, wherein the word segmentation database includes the new words determined by the method in the first aspect.
In another aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory:
the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform the method provided by any aspect of the present application.
In yet another aspect, the embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the method provided in any aspect of the present application.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
in the embodiment of the application, word sequence mining can be performed on the sample text set to obtain frequent word sequences corresponding to various lengths, then the super sequences in the frequent word sequences are screened based on the participles contained in the sample text set, and the screened super sequences are used as new words, so that frequently updated words, words or phrases can be better screened by adopting a word sequence mining mode, and the method has important reference values and practical significance in applications such as word segmentation and new word discovery; in addition, in the process of determining the new words, a complex neural network model does not need to be trained, and training samples do not need to be labeled manually, so that the training cost is effectively reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a method for determining new words according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an apparatus for determining new words according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
The appearance of new words and word segmentation and mis-splitting are common phenomena of natural language understanding and are difficult problems which must be broken through by the existing language processing technology, and the existing methods for finding and identifying the new words mainly comprise a new word finding method based on a language model, a new word finding algorithm based on segmentation or a new word finding method based on deep learning. These methods are briefly described below.
1. The method for discovering the new words based on the language model comprises the following steps: the method is a method for calculating by utilizing conditional probability, the probability of a word is converted into the combined probability of the word by a Bayesian formula, and a new word is determined based on the combined probability of the word.
2. The segmentation-based new word discovery algorithm: the segmentation of the algorithm is based on the degree of solidification among elements, and the new words are identified by selecting when to segment words through the calculation of the degree of solidification among word elements.
3. The new word discovery method based on deep learning comprises the following steps: the method comprises the steps of firstly carrying out word segmentation, then calculating the information amount, adding rule identification, and carrying out new word discovery through a deep learning model.
However, the existing methods described above have the following problems to be improved:
1. when the new word discovery method based on the language model is adopted, the effect depends on the quality of the language model, and the probability transformation through the Bayes formula is premised on the independence assumption of the characteristics, but actually, the characters have a certain degree of association and are not completely independent.
2. The segmentation-based new word discovery method needs to select a larger solidification degree in order to avoid cutting too many invalid words out of the segmented words, and actually, the solidification degree of the words formed by two characters does not need to be large, namely, the segmentation standard is not strict.
3. The new word discovery method based on the deep learning model needs to train the neural network based on a large number of labeled training samples, and for industrial application, the problem that the efficiency is difficult to meet the online requirement exists.
Based on this, the present application provides a method, an apparatus, an electronic device, and a computer-readable storage medium for determining a new word, which are intended to solve at least one technical problem in the prior art.
The following describes the technical solution of the present application and how to solve at least one of the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The method for determining the new words provided by the embodiment of the application can be applied to any electronic equipment, such as products of smart phones, tablet computers and the like. Of course, the method may also be applied to a server (including but not limited to a physical server, a cloud server), and the server may determine a new word from the sample text set based on the method provided in the embodiments of the present application.
Fig. 1 shows a flow chart of a method for determining new words provided in an embodiment of the present application. As shown in fig. 1, the method includes:
step S101, a sample text set is obtained.
The sample text set includes a plurality of sample texts, and the plurality of sample texts may include new words, for example, some sample texts include newly generated network terms, some sample texts include newly generated professional terms in a certain field, and the like. In an alternative embodiment of the present application, the specific form of the sample text may be a single sentence or a single word, and may be configured in advance according to the actual application.
In an alternative embodiment of the present application, obtaining a sample text set may include:
acquiring an initial text set, wherein the initial text set comprises initial texts;
respectively carrying out character preprocessing on each initial text to obtain a preprocessing result corresponding to each initial text;
obtaining a sample text set based on the preprocessing result corresponding to each initial text;
wherein the word preprocessing includes at least one of a sentence dividing processing and a specific character deleting processing.
Specifically, sentence segmentation refers to segmenting an article or a text segment with multiple sentences into multiple independent sentences; the specific character deleting process refers to deleting specific characters in the text, and specific types of the specific characters may be configured in advance, for example, the specific types may be configured according to an actual application scenario and/or experience, which is not limited in this embodiment of the application. For example, a punctuation mark may be set to a specific character, at which time when a comma exists in the initial text, the comma may be deleted from the initial text.
In practical application, the initial text set includes each initial text, and the specific form of the initial text is not limited in the embodiment of the present application, for example, the initial text may be a section of an article with multiple clauses, or may be a single sentence, that is, the granularity of the initial text is not limited in the embodiment of the present application, and may be configured according to practical application needs. As an alternative, since the sample text may be a single sentence or a single sentence, when the initial text is an article or a text segment, the article or the text segment may be subjected to a clause processing, and each clause (i.e., a preprocessing result) obtained by the processing is taken as a sample text.
In an example, assuming that an initial text in an initial text set is "i love natural language processing, natural language processing is a core tool for text analysis and mining, fine-grained emotion analysis is one of emotion analysis, the text develops research aiming at key technologies in the fine-grained emotion analysis", word preprocessing includes sentence segmentation processing and specific character deletion processing, and a specific character is a punctuation mark, at this time, the sentence segmentation processing and the specific character deletion processing can be performed on the initial text, so as to obtain a processed initial text, namely, the text includes "i love natural language processing", "natural language processing is a core tool for text analysis and mining", "fine-grained emotion analysis is one of emotion analysis", and "the text develops research" four independent sentences aiming at key technologies in the fine-grained emotion analysis ", and the four independent sentences are taken as 4 sample texts in a sample text set, specifically, the results are shown in Table 1.
TABLE 1
I love natural language processing
Natural language processing is a core tool for text analysis and mining
Fine-grained sentiment analysis is a kind of sentiment analysis
This paper is directed to key technologies in fine-grained sentiment analysis
It can be understood that, in the embodiment of the present application, when the word preprocessing includes a clause processing and a specific character deleting processing, and the specific character deleting processing is a punctuation deleting processing, the clause processing may be performed on the initial text first, and then the punctuation deleting processing may be performed.
And step S102, carrying out word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths.
The word elements refer to characters included in the frequent word sequence, and the characters in different forms are different word elements, for example, the characters ' word ' and ' in the frequent word sequence are different word elements; the frequent word sequence refers to a word sequence which frequently appears in the sample text set, and the length refers to the number of word elements which are packaged in the frequent word sequence, for example, when a certain frequent word sequence is "natural language", the frequent word sequence includes four word elements of a word element "self", a word element "natural", a word element "language" and a word element "language", and the length of the frequent word sequence "natural language" is 4. Optionally, each length in this embodiment of the present application may refer to a length from one word element to a length of a word element included in the longest frequent word sequence, or may refer to a length from a set starting length (for example, a length including 1 word element or a length including two word elements) to a length of a word element included in the longest frequent word sequence, which is not limited in this embodiment of the present application.
In practical application, after the sample text set is obtained, word sequence mining can be performed on each sample text included in the sample text set to obtain frequent word sequences corresponding to various lengths.
In step S103, each supersequence in the frequent word sequences corresponding to each length is determined.
In practical applications, if all elements of a certain frequent word sequence a can be found in the item set of the frequent word sequence B, the frequent word sequence a is a subsequence of the frequent word sequence B. According to this definition, assume that a is { a ] for a frequent word sequence1,a2,...anThe sum-frequency-complex word sequence B ═ B1,b2,...bmN is less than or equal to m, if a number sequence 1 is less than or equal to j1≤j2≤...≤jmLess than or equal to m, satisfy
Figure BDA0002533658820000071
Then frequent word sequence a is said to be a subsequence of frequent word sequence B, which in turn is said to be a supersequence of frequent word sequence a.
In practical applications, since the lengths of the frequent word sequences obtained by mining are different, there may be a case where elements in some frequent word sequences are all contained in other frequent word sequences, and it may be determined which frequent word sequences in the frequent word sequences belong to the supersequence and which frequent word sequences are subsequences. Because the super sequence contains more reference information, in order to ensure the integrity of the information and reduce the subsequent data processing amount, the sub sequence can be deleted at the moment, and the super sequence is reserved.
The implementation manner of determining each supersequence in the frequent word sequence may be preconfigured, which is not limited in the embodiment of the present application. For example, the super-sequence in the frequent word sequence can be determined by setting an n-gram window, and the number of word elements (i.e., n) included in the n-gram window can be determined according to the length of each frequent word sequence.
In an example, assuming that the length of the maximum-length frequent word sequence in the frequent word sequences of each length is 8, n may be set to 8, and a word element is arbitrarily selected as a center (i.e., W) in each frequent word sequence, such as selecting a word element "word" as W, then each frequent word sequence including the word element "word" and having a number of word elements before and after the word element "word" (which may be referred to as contextual auxiliary information context (W) of W) less than 8 is found in each frequent word sequence, and then each frequent word sequence found includes the word element "word" at this time is determined, and then which frequent word sequences in each frequent word sequence found belong to the super-sequence and which frequent word sequences belong to the sub-sequence are determined. If it is determined that both the frequent word sequence "natural language processing" and the frequent word sequence "natural language" include the word element "language", since the "natural language processing" also includes the context auxiliary information "processing" on the basis of the subsequence "natural language", the supersequence mode "natural language processing" can be retained, and the subsequence "natural language" can be deleted.
And step S104, for each supersequence, if the supersequence is not contained in each participle contained in the sample text set, determining the supersequence as a new word.
In practical application, the sample texts included in the sample text set may be participled to obtain the participles included in the sample text set. The specific implementation manner of performing word segmentation on each sample text may be configured in advance, and the embodiment of the application is not limited, for example, existing word segmentation tools (such as jieba and other word segmentation tools) may be used to perform word segmentation on the sample text; then, for each supersequence, it is determined whether the word contained in the sample text set includes the supersequence, if the supersequence is not included, the supersequence may be determined to be a new word, and conversely, if the word contained in the sample text set includes the supersequence, the supersequence is said to be an existing word. In addition, in practical application, it may also be determined whether each supersequence is a new word based on a word segmentation result obtained by performing word segmentation on the initial text in the initial text set, which is not limited in the embodiment of the present application.
Continuing with the previous example, it is assumed that whether each supersequence is a new word is determined based on the segmentation result of the initial text in the initial text set, at this time, the segmentation result of the initial text in the initial text set is specifically shown in table 2, then, for each supersequence, it is determined whether the segmentation shown in table 2 includes the supersequence, if not, the supersequence is determined to be a new word, at this time, the obtained result of the new word is shown in table 3, for example, for a supersequence "natural language processing" which is not shown in table 2, at this time, the supersequence "natural language processing" is a new word.
TABLE 2
Word segmentation result
I love natural language processing
Natural language processing is a core tool for text analysis and mining
Fine-grained sentiment analysis is one kind of sentiment analysis
The present document develops research on key technologies in fine-grained sentiment analysis
TABLE 3
New word results
Fine-grained sentiment analysis
Natural language processing
In the embodiment of the application, word sequence mining can be performed on the sample text set to obtain frequent word sequences corresponding to various lengths, then the super sequences in the frequent word sequences are screened based on the participles contained in the sample text set, and the screened super sequences are used as new words, so that frequently updated words, words or phrases can be better screened, and therefore, the method has important reference values and practical significance in applications such as finding the participles and the new words; in addition, in the process of determining the new words, a complex neural network model does not need to be trained, and training samples do not need to be labeled manually, so that the training cost is effectively reduced.
In an optional embodiment of the present application, word sequence mining is performed on a sample text set to obtain frequent word sequences corresponding to each length, including:
determining the sample number respectively corresponding to each word element in the sample text set, wherein for a word element, the sample number corresponding to the word element refers to the number of sample texts containing the word sequence element in the sample text set;
filtering the word elements included in the sample text set based on the sample number corresponding to each word element to obtain a processed sample text set;
and carrying out word sequence mining on the processed sample text set to obtain frequent word sequences corresponding to all lengths.
In practical applications, the number of samples corresponding to each word element included in the sample text set may be counted, and it should be noted that, for a word element, when there are a plurality of word elements in the same sample, the sample still counts as a sample, that is, the number of samples is increased by 1.
In an example, assuming that sample texts included in a sample text set are four independent clauses, namely "i love natural language processing", "natural language processing is a core tool for text analysis and mining", "fine-grained emotion analysis is one kind of emotion analysis", and "this document develops research on key technology in fine-grained emotion analysis", at this time, the sample number corresponding to each word element that can be obtained by statistics exists, if the word element "score" exists in "natural language processing is a core tool for text analysis and mining", "fine-grained emotion analysis is one kind of emotion analysis", and "this document develops research on key technology in fine-grained emotion analysis", at this time, the sample number corresponding to the word element "score" is 3, and the word element "i" appears only in "love natural language processing", the sample number corresponding to the word element "i" is 1, similarly, the sample numbers corresponding to other word elements, such as 3 for the sample number corresponding to the word element "analyze", 3 for the sample number corresponding to the word element "feel", and 3 for the sample number corresponding to the word element "feel", can be obtained, and are not described herein again.
Further, the word elements included in the sample text set may be filtered based on the sample number corresponding to each word element to obtain a processed sample text set, and then word sequence mining may be performed on the processed sample text set to obtain frequent word sequences corresponding to each length.
In an optional embodiment of the present application, filtering word elements included in the sample text set based on the number of samples corresponding to each word element to obtain a processed sample text set includes:
for the sample number of a word element, if the sample number meets the set condition, deleting the word element from the sample text set;
the number of samples satisfying the set condition includes at least one of:
the number of samples is less than a set value or the ratio of the number of samples is less than a preset value;
the proportion of the number of samples is the ratio of the number of samples corresponding to the character element to the number of sample texts included in the sample text set.
In practical application, for a word element, it may be determined whether the number of samples corresponding to the word element satisfies a set condition, and if the preset condition is satisfied, and when the word element exists in the sample text, the word element may be deleted from the sample text. The sample number satisfying the setting condition may include that the sample number is smaller than at least one of the setting value and a ratio of the sample number to a preset value, where the ratio of the sample number is a ratio of the sample number corresponding to the word element to the number of sample texts included in the sample text set, for example, the sample number corresponding to a certain word element is 4, the sample text set includes 4 sample texts, and the ratio of the sample number of the word element is 4/4 ═ 1.
Continuing the above example, it is assumed that the number of samples satisfies the setting condition that the number of samples is smaller than the set value, and the set value is set to 2; further, since the number of samples corresponding to the word element "score" is 3, which is greater than the preset threshold 2, the word element "score" may be retained, and the number of samples corresponding to the word element "me" is 1, which is less than the preset threshold 2, the word element "me" may be deleted, and similarly, it may be determined whether other word elements, such as the word element "analyze", the number of samples corresponding to the word element "3, and the word element" feel ", need to be deleted, and the result of the filtered word element at this time is shown in table 4.
TABLE 4
Character element Number of samples
Is divided into 3
Analysis of 3
Is/are as follows 3
Feeling of 2
Love of a person 2
Book (I) 2
To 2
Degree of rotation 2
Theory of things 2
Granule 2
However, the device is not suitable for use in a kitchen 2
Is that 2
Article (Chinese character) 2
Thin and thin 2
Chinese character of' Yan 2
Language (1) 2
From 2
Further, each sample text in this example may only retain word elements included in the "word element" column in table 4, so as to obtain a processed sample text set, which may specifically be as shown in table 5. If the element "i" meets the set condition, the word element "i" can be deleted from the sample text "i love natural language processing" at this time, and a processed sample (i.e., "love natural language processing") is obtained.
TABLE 5
Processed sample text
Natural language processing
The natural language processing being text analysis
Fine grained sentiment analysis is the caseFor sensory analysis
Fine grained sentiment analysis of text
It is to be understood that, when the number of samples satisfies the setting condition that the proportion of the number of samples is smaller than the preset value, the proportion of the number of samples corresponding to each word element may be determined, and then the word elements having the proportions smaller than the preset value may be deleted from the sample text set. For example, in the above example, if the number of samples corresponding to the word element "i" is 1, the ratio of the number of samples to the total number of samples is 1/4, and is smaller than the preset value 1/3, the word element "i" may be deleted from each sample text, and the number of samples corresponding to the word element "feeling" is 3, the ratio of the number of samples to the total number of samples is 3/4, and is not smaller than the preset value 1/3, and the word element "feeling" may be retained.
In an optional embodiment of the present application, word sequence mining is performed on a sample text set to obtain frequent word sequences corresponding to each length, including:
based on a Prefix-Projected Pattern Growth (Prefix-Projected Pattern Growth) algorithm, word sequence mining is performed on the sample text set to obtain frequent word sequences corresponding to each length.
In practical application, a minimum support threshold value can be preset, and then word sequence mining is performed on the sample text set by adopting a Prefix span algorithm to obtain frequent word sequences corresponding to various lengths. The calculation method of the minimum support degree is as follows.
min_sup=a×n
Wherein n is the number of samples, a is the minimum support rate, the minimum support rate can be adjusted according to the magnitude of the sample text set, and min _ sup is the minimum support degree.
The specific operation steps for carrying out word sequence mining based on the Prefix span algorithm are as follows:
1. finding out a word sequence prefix with unit length of 1 and a corresponding projection data set;
2. counting the occurrence frequency of prefix of the word sequence, adding the prefix with the support degree higher than the minimum support degree threshold value into a data set, and acquiring a frequent word sequence of one set;
3. and recursively mining all prefixes with the length of i and meeting the requirement of a minimum support threshold:
4. excavating a projection data set of the prefix, and if the projection data is an empty set, returning to the recursion;
5. counting the minimum support degree of each item in the corresponding projection data set, combining each single item meeting the support degree with the current prefix to obtain a new prefix, and recursively returning if the support degree requirement is not met;
6. making i equal to i +1, wherein the prefixes are new prefixes after single item combination, and respectively executing the step 3 recursively until the projection data sets of the prefixes are all smaller than the minimum support degree;
7. all frequent word sequences in the word sequence dataset are returned.
In an example, assuming that the sample text set includes 4 sample texts as shown in table 5, when word sequence mining is performed on the sample text set by using the prefix span algorithm to obtain frequent word sequences corresponding to respective lengths, a prefix (i.e., a prefix) with a length of 1 may be mined first, and at this time, a prefix satisfying the minimum support threshold and an adjacent suffix (i.e., a word element included in a subsequent portion of the prefix adjacent to the prefix in the sample text) corresponding to the prefix may be determined. For example, for a prefix "score", in the case that the adjacent suffixes of the sample text "natural language processing is text analysis", "fine-grained emotion analysis is emotion analysis" and "fine-grained emotion analysis herein" are "analyzed", for a prefix "that does not exist in each sample text (all indicated by" none "in the table), the corresponding adjacent suffixes of other prefixes can be obtained by the same method, as shown in table 6:
TABLE 6
Figure BDA0002533658820000131
Further, prefixes of length 2 (i.e., binomial prefixes) are mined, and then each binomial prefix satisfying the minimum support threshold and its corresponding adjacent suffix (i.e., the word elements included subsequently) may be determined. For example, for a prefix "score", it may be determined whether the ratio of the number of samples corresponding to each word element in its corresponding adjacent suffixes to the total number of samples (the total number of adjacent suffixes for a prefix) is greater than the minimum support threshold 1/3, for a word element "score", the ratio of the number of samples (3) corresponding to the word element to the total number of samples (3) is 1 and greater than the minimum support threshold 1/3, at which time the word element and a prefix may be merged into a two-item prefix "parse", and the adjacent suffix corresponding to the two-item prefix "parse" is determined. Similarly, other binomial prefixes and corresponding adjacent suffixes may be obtained, as shown in table 7:
TABLE 7
Figure BDA0002533658820000141
Further, a prefix with a length of 3 (i.e., a three-term prefix) is mined, and at this time, each three-term prefix satisfying the minimum support degree threshold and its corresponding adjacent suffix can be determined. For example, for a prefix "sentiment score", it may be determined whether the ratio of the number of samples to the total number of samples corresponding to each word element in its corresponding adjacent suffixes is greater than the minimum support threshold 1/3, such as for a word element "analysiso", where the ratio of the number of samples (2) to the total number of samples (2) is 1 and greater than the minimum support threshold 1/3, at which time the word element and the two prefixes may be combined into a three-prefix "sentiment analysis", and the adjacent suffixes corresponding to the three-prefix "sentiment analysis" may be determined. Similarly, other three prefixes and corresponding adjacent suffixes can be obtained, as shown in table 8:
TABLE 8
Figure BDA0002533658820000151
Further, based on the same principle, four prefixes and corresponding adjacent suffixes are mined, for example, for a three-prefix "speech", it may be determined whether a ratio of the number of samples corresponding to each word element in each corresponding adjacent suffix to the total number of samples is greater than the minimum support threshold 1/3, for example, for a word element "word", a ratio of the number of samples (2) corresponding to each word element to the total number of samples (2) is 1 and greater than the minimum support threshold 1/3, at which time the word element and the three prefixes may be merged into a four-prefix "speech", and an adjacent suffix corresponding to the four-prefix "speech" is determined. Similarly, other four prefixes and corresponding adjacent suffixes can be obtained, and the results are shown in table 9:
TABLE 9
Figure BDA0002533658820000161
Further, based on the same principle, five prefixes and corresponding adjacent suffixes are obtained by mining, and the result is shown in table 10:
watch 10
Figure BDA0002533658820000162
Further, based on the same principle, six prefixes and corresponding adjacent suffixes are obtained by mining, and the result is shown in table 11: :
TABLE 11
Figure BDA0002533658820000163
Further, based on the same principle, seven prefixes and corresponding adjacent suffixes are obtained by mining, and the result is shown in table 12:
TABLE 12
Figure BDA0002533658820000171
Further, since word elements in adjacent suffixes corresponding to the seven prefixes are all not full of the minimum support degree threshold, at this time, eight prefixes do not exist, and thus mining iteration is finished, and each prefix is used as a frequent word sequence of each length.
In an alternative embodiment of the present application, determining each supersequence in the frequent word sequences corresponding to each length comprises:
performing auxiliary word filtering on each frequent character sequence to obtain each filtered frequent character sequence;
and determining each supersequence in each filtered frequent word sequence.
In practical applications, after obtaining the frequent character sequences, some frequent character sequences may include some auxiliary words, and in order to reduce data processing amount, auxiliary word filtering may be performed on each frequent character sequence to obtain each filtered frequent character sequence, and then each super sequence in each filtered frequent character sequence is determined. The common auxiliary words comprise, have, are, etc., are, etc., and the way of filtering the auxiliary words of the frequent character sequence can adopt the way of constructing an auxiliary word thesaurus or carrying out auxiliary word filtering based on syntactic analysis, etc.
Fig. 2 is a flowchart illustrating a text processing method provided in an embodiment of the present application. As shown in fig. 2, the method includes:
step S201, acquiring a text to be processed;
step S202, performing word segmentation processing on the text to be processed based on the word segmentation database to obtain words included in the text to be processed; wherein, the word segmentation database comprises the new words determined by the method
The text to be processed refers to a text that needs to be participled, and may be a single sentence, an article with multiple clauses, or a text fragment.
In practical application, after each new word is recognized, the new word obtained through recognition can be added into a word bank of the existing word segmentation tool, for example, a new word bank can be independently established in the existing word segmentation tool, or the new word is directly added into an original word bank of the existing word segmentation tool. Further, when the to-be-processed text is obtained and the word segmentation tool performs word segmentation on the to-be-processed text, if the word segmentation tool includes a new word bank and an original word bank, the to-be-processed text can be segmented on the basis of the new word bank to obtain new words included in the to-be-processed text, and then the to-be-processed text without the new words is segmented on the basis of the original word bank to obtain a final word segmentation result; correspondingly, if the original word stock in the word segmentation tool includes a new word, it may be determined whether the text to be processed includes a word with the largest length (meaning including the number of words) in the original word stock, if so, the word in the text to be processed is taken as a word segmentation, and further, it may be determined whether the text to be processed without the word segmentation includes a next-longest word in the original word stock, if so, the next-longest word in the text to be processed is taken as a word segmentation, and so on until all the word segmentations included in the text to be processed are obtained.
It should be understood that, in this example, the method provided in the embodiment of the present application is merely exemplified by an application scenario of word segmentation, but the method provided in the embodiment of the present application includes, but is not limited to, an application scenario of word segmentation, and may also be applied to other application scenarios, such as new word discovery, syntactic analysis, keyword extraction, and the like.
In summary, the method provided by the embodiment of the application is used for discovering new words based on a sequence pattern mining method, and meanwhile, the matching rules of the word segmentation word bank are set, so that names such as professional words and structure names can be better identified, the problem caused by mistaken splitting of the existing word segmentation tool is reduced, and the accuracy of other follow-up tasks can be greatly improved.
The embodiment of the present application provides an apparatus for determining a new word, and as shown in fig. 3, the apparatus 60 for determining a new word may include: a text acquisition module 601, a sequence mining module 602, a supersequence determination module 603, and a new word determination module 604, wherein,
a text obtaining module 601, configured to obtain a sample text set;
a sequence mining module 602, configured to perform word sequence mining on the sample text set to obtain frequent word sequences corresponding to each length;
a supersequence determining module 603 configured to determine supersequences in the frequent word sequences corresponding to the respective lengths;
and a new word determining module 604, configured to, for each supersequence, determine the supersequence as a new word if the supersequence is not included in the participles included in the sample text set.
Optionally, when performing word sequence mining on the sample text set to obtain frequent word sequences corresponding to each length, the sequence mining module is specifically configured to:
determining the sample number respectively corresponding to each word element in the sample text set, wherein for a word element, the sample number corresponding to the word element refers to the number of sample texts containing the word sequence element in the sample text set;
filtering the word elements included in the sample text set based on the sample number corresponding to each word element to obtain a processed sample text set;
and carrying out word sequence mining on the processed sample text set to obtain frequent word sequences corresponding to all lengths.
Optionally, the sequence mining module, when filtering the word elements included in the sample text set based on the number of samples corresponding to each word element to obtain the processed sample text set, is specifically configured to:
for the sample number of a word element, if the sample number meets the set condition, deleting the word element from the sample text set;
the number of samples satisfying the set condition includes at least one of:
the number of samples is less than a set value or the ratio of the number of samples is less than a preset value;
the proportion of the number of samples is the ratio of the number of samples corresponding to the character element to the number of sample texts included in the sample text set.
Optionally, when performing word sequence mining on the sample text set to obtain frequent word sequences corresponding to each length, the sequence mining module is specifically configured to:
and based on a Prefix span algorithm, carrying out word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths.
Optionally, when determining each supersequence in the frequent word sequences corresponding to each length, the supersequence determining module is specifically configured to:
performing auxiliary word filtering on each frequent character sequence to obtain each filtered frequent character sequence;
and determining each supersequence in each filtered frequent word sequence.
Optionally, when the text acquisition module acquires the sample text set, the method is specifically configured to:
acquiring an initial text set, wherein the initial text set comprises initial texts;
respectively carrying out character preprocessing on each initial text to obtain a preprocessing result corresponding to each initial text;
obtaining a sample text set based on the preprocessing result corresponding to each initial text;
wherein the word preprocessing includes at least one of a sentence dividing processing and a specific character deleting processing.
The apparatus for determining a new word in the embodiment of the present application can execute the method for determining a new word provided in the embodiment of the present application, and the implementation principles are similar, and are not described herein again.
An embodiment of the present application provides a text processing apparatus, and as shown in fig. 4, the text processing apparatus 70 may include: a text acquisition module 701 and a word segmentation processing module 702, wherein,
a text obtaining module 701, configured to obtain a text to be processed;
a word segmentation processing module 702, configured to perform word segmentation processing on the text to be processed based on the word segmentation database, so as to obtain words included in the text to be processed; the word segmentation database comprises new words determined by the method in the embodiment.
The text processing apparatus according to the embodiment of the present application can execute the text processing method according to the embodiment of the present application, and the implementation principles thereof are similar and will not be described herein again.
An embodiment of the present application provides an electronic device, as shown in fig. 5, an electronic device 2000 shown in fig. 5 includes: a processor 2001 and a memory 2003. Wherein the processor 2001 is coupled to a memory 2003, such as via a bus 2002. Optionally, the electronic device 2000 may also include a transceiver 2004. It should be noted that the transceiver 2004 is not limited to one in practical applications, and the structure of the electronic device 2000 is not limited to the embodiment of the present application.
The processor 2001 is applied in the embodiment of the present application to implement the functions of the modules shown in fig. 3 and 4.
The processor 2001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.
Bus 2002 may include a path that conveys information between the aforementioned components. The bus 2002 may be a PCI bus or an EISA bus, etc. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
The memory 2003 may be, but is not limited to, ROM or other types of static storage devices that can store static information and computer programs, RAM or other types of dynamic storage devices that can store information and computer programs, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store a desired computer program or in the form of a data structure and that can be accessed by a computer.
The memory 2003 is used for storing computer programs for executing the application programs of the present scheme and is controlled in execution by the processor 2001. The processor 2001 is used to execute a computer program of an application program stored in the memory 2003 to realize the actions of the apparatus in the embodiment shown in fig. 3 or fig. 4.
An embodiment of the present application provides an electronic device, including a processor and a memory: the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform any of the methods of the above embodiments.
The present application provides a computer-readable storage medium for storing a computer program, which, when executed on a computer, enables the computer to perform any one of the above-mentioned methods.
The terms and implementation principles used in this application for a computer-readable storage medium may refer to the method in the embodiments of the present application, and are not described herein again.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A method of determining new words, comprising:
acquiring a sample text set;
performing word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths;
determining each super sequence in the frequent word sequences corresponding to each length;
for each supersequence, determining the supersequence as a new word if the supersequence is not included in the participles included in the sample text set.
2. The method of claim 1, wherein said word sequence mining said sample text set to obtain frequent word sequences corresponding to respective lengths comprises:
determining the number of samples corresponding to each word element included in the sample text set, wherein for a word element, the number of samples corresponding to the word element refers to the number of sample texts including the word element in the sample text set;
filtering the word elements included in the sample text set based on the sample number corresponding to each word element to obtain a processed sample text set;
and carrying out word sequence mining on the processed sample text set to obtain frequent word sequences corresponding to all lengths.
3. The method of claim 2, wherein the filtering the word elements included in the sample text set based on the number of samples corresponding to each word element to obtain a processed sample text set comprises:
for the sample number of a word element, if the sample number meets a set condition, deleting the word element from the sample text set;
the number of samples satisfying a set condition includes at least one of:
the number of samples is less than a set value or the ratio of the number of samples is less than a preset value;
wherein the proportion of the sample number is a ratio of the sample number corresponding to the character element to the number of sample texts included in the sample text set.
4. The method of claim 1, wherein said word sequence mining said sample text set to obtain frequent word sequences corresponding to respective lengths comprises:
and performing word sequence mining on the sample text set based on a prefix projection mode mining Prefix span algorithm to obtain frequent word sequences corresponding to all lengths.
5. The method of claim 1, wherein determining each supersequence in the sequences of frequent words corresponding to each length comprises:
performing auxiliary word filtering on each frequent character sequence to obtain each filtered frequent character sequence;
and determining each supersequence in each filtered frequent word sequence.
6. The method of claim 1, wherein obtaining the sample text set comprises:
acquiring an initial text set, wherein the initial text set comprises initial texts;
respectively carrying out character preprocessing on each initial text to obtain a preprocessing result corresponding to each initial text;
obtaining the sample text set based on the preprocessing result corresponding to each initial text;
wherein the word preprocessing includes at least one of a sentence dividing processing and a specific character deleting processing.
7. A method of text processing, the method comprising:
acquiring a text to be processed;
performing word segmentation on the text to be processed based on a word segmentation database to obtain words included in the text to be processed, wherein the word segmentation database includes new words determined by the method of claims 1 to 6.
8. An apparatus for determining new words, comprising:
the text acquisition module is used for acquiring a sample text set;
the sequence mining module is used for carrying out word sequence mining on the sample text set to obtain frequent word sequences corresponding to all lengths;
a super sequence determining module, configured to determine each super sequence in the frequent word sequences corresponding to each length;
and the new word determining module is used for determining the supersequence as a new word if the supersequence is not contained in each participle contained in the sample text set.
9. A text processing apparatus, comprising:
the text acquisition module is used for acquiring a text to be processed;
a word segmentation processing module, configured to perform word segmentation processing on the text to be processed based on a word segmentation database to obtain words included in the text to be processed, where the word segmentation database includes new words determined by the method in claims 1 to 6.
10. An electronic device, comprising a processor and a memory:
the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform the method of any one of claims 1-7.
CN202010525541.1A 2020-06-10 2020-06-10 Method and device for determining new words, electronic equipment and readable storage medium Pending CN111680146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010525541.1A CN111680146A (en) 2020-06-10 2020-06-10 Method and device for determining new words, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010525541.1A CN111680146A (en) 2020-06-10 2020-06-10 Method and device for determining new words, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111680146A true CN111680146A (en) 2020-09-18

Family

ID=72454421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010525541.1A Pending CN111680146A (en) 2020-06-10 2020-06-10 Method and device for determining new words, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111680146A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417093A (en) * 2020-11-11 2021-02-26 北京三快在线科技有限公司 Model training method and device
CN113010642A (en) * 2021-03-17 2021-06-22 腾讯科技(深圳)有限公司 Semantic relation recognition method and device, electronic equipment and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417093A (en) * 2020-11-11 2021-02-26 北京三快在线科技有限公司 Model training method and device
CN112417093B (en) * 2020-11-11 2024-03-08 北京三快在线科技有限公司 Model training method and device
CN113010642A (en) * 2021-03-17 2021-06-22 腾讯科技(深圳)有限公司 Semantic relation recognition method and device, electronic equipment and readable storage medium
CN113010642B (en) * 2021-03-17 2023-12-15 腾讯科技(深圳)有限公司 Semantic relation recognition method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN105095204B (en) The acquisition methods and device of synonym
Yaghoobzadeh et al. Multi-level representations for fine-grained typing of knowledge base entities
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN111222305A (en) Information structuring method and device
CN111651986B (en) Event keyword extraction method, device, equipment and medium
US11170169B2 (en) System and method for language-independent contextual embedding
CN111291177A (en) Information processing method and device and computer storage medium
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN106980620A (en) A kind of method and device matched to Chinese character string
Abujar et al. An approach for bengali text summarization using word2vector
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN112711666B (en) Futures label extraction method and device
JP6867963B2 (en) Summary Evaluation device, method, program, and storage medium
CN104021202A (en) Device and method for processing entries of knowledge sharing platform
Kadim et al. Parallel HMM-based approach for arabic part of speech tagging.
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN116644148A (en) Keyword recognition method and device, electronic equipment and storage medium
CN111507098B (en) Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN114491076A (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN113010642A (en) Semantic relation recognition method and device, electronic equipment and readable storage medium
WO2021221535A1 (en) System and method for augmenting a training set for machine learning algorithms
CN110765239B (en) Hot word recognition method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination