CN115858771A - Word searching method and device and computer readable storage medium - Google Patents
Word searching method and device and computer readable storage medium Download PDFInfo
- Publication number
- CN115858771A CN115858771A CN202210027820.4A CN202210027820A CN115858771A CN 115858771 A CN115858771 A CN 115858771A CN 202210027820 A CN202210027820 A CN 202210027820A CN 115858771 A CN115858771 A CN 115858771A
- Authority
- CN
- China
- Prior art keywords
- word
- corpus
- determining
- words
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000011218 segmentation Effects 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 69
- 238000004590 computer program Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 8
- 238000007477 logistic regression Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 239000010970 precious metal Substances 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a word searching method, a word searching device and a computer readable storage medium, wherein the method comprises the following steps: obtaining a corpus, and classifying each statement in the corpus to obtain a plurality of sets, wherein each set comprises a plurality of statements, and each statement in each set belongs to the same service; performing word segmentation on the sentences in each set to obtain first words corresponding to each set; according to the set to which each first word belongs, determining a mutual information value between points and a left-right cross entropy corresponding to each first word; and determining that the first word is a target word when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, wherein the target word is a domain word which is not stored. The invention improves the searching accuracy of the new words.
Description
Technical Field
The present invention relates to the technical field of sentence segmentation, and in particular, to a method and an apparatus for searching a word, and a computer-readable storage medium.
Background
Currently, the NLP (Natural Language Processing) layer mainly includes an upstream task and a downstream task. The upstream task is mainly a matter in some aspects of feature engineering, and the downstream task is specifically applied to certain scene requirements, such as text classification, sequence identification, sequence generation, NLU (Natural Language Understanding), and the like. The characteristic engineering is to map the structured text, the unstructured text and the semi-structured text into mathematical vector space representation based on words, terms and semantic levels. The quality of the mathematical vector representation is therefore crucial to the quality of the subsequent task. At present, the text is firstly cut into words and then coded, and distributed compressed representation can be carried out in a deeper level. In the aspect of word segmentation, no matter words, words and N-Gram (an algorithm based on a statistical language model) word segmentation are carried out, information loss or information redundancy in a specific service field can be caused. For example, the pre-training model builds a dictionary (the dictionary is composed of a plurality of words) on the basis of words, and the word meaning and the semantic meaning are lost in the initial task. In contrast, new words need to be extracted as domain words according to the text to avoid loss of service domain information or information redundancy.
At present, a supervised learning named entity mining algorithm can be used for searching for new words, but the text needs to be labeled in advance manually and entities in the text need to be named. The scheme is to mine new words in the field by using a statistical method for mining and auditing. The PMI (mutual information between points) and the left and right cross entropy in the statistical method can be used to mine new words in the field. The PMI mainly determines the cohesion degree of a word, the probability that two words are considered to be a new word with high co-occurrence frequency is higher, the left-right cross entropy mainly determines the boundary degree of the word, and if the left-right cross entropy is higher, the probability of matching with the word is higher and richer, and the probability of the word as the new word is higher.
If the services are different, the information contained in the text of the service is different, so that the co-occurrence degree of the texts under different services is smaller, and the new word search is inaccurate.
Disclosure of Invention
The invention mainly aims to provide a method and a device for searching words and a computer readable storage medium, aiming at solving the problem of inaccurate searching of new words.
In order to achieve the above object, the present invention provides a word searching method, which comprises the following steps:
obtaining a corpus, and classifying each statement in the corpus to obtain a plurality of sets, wherein each set comprises a plurality of statements, and each statement in each set belongs to the same service;
performing word segmentation on the sentences in each set to obtain first words corresponding to each set;
according to the set to which each first word belongs, determining a mutual information value between points and a left-right cross entropy corresponding to each first word;
and determining that the first word is a target word when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, wherein the target word is a domain word which is not stored.
In one embodiment, the mutual information value between points corresponding to the first word is greater than a first threshold, and the left-right cross entropy corresponding to the first word is greater than a second threshold, and the step of determining that the first word is a target word includes:
when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, constructing a first vector corresponding to the first word;
determining the length of the first word, and constructing a second vector of the first word according to the length;
determining a third vector of the first word according to the first vector and the second vector of the first word, and inputting the third vector corresponding to each first word into a prediction model to obtain a label value corresponding to each first word output by the prediction model;
and when the label value corresponding to the first word is a preset value, determining that the first word is a target word.
In an embodiment, when the inter-point mutual information value corresponding to the first word is greater than a first threshold and the left-right cross entropy corresponding to the first word is greater than a second threshold, the step of constructing the first vector corresponding to the first word includes:
when the inter-point mutual information value corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, determining whether the first word is a preset combination, wherein the preset combination is a moving word combination, a famous word combination or a moving famous word combination;
and when the first word is a preset combination, constructing a first vector corresponding to the first word.
In an embodiment, the step of obtaining a corpus and classifying each sentence in the corpus to obtain a plurality of sets includes:
obtaining a corpus and determining whether the corpus is a title or not;
and when the corpus is not a title, classifying each sentence in the corpus to obtain a plurality of sets.
In an embodiment, after the step of determining whether the corpus is a title, the method further includes:
when the corpus is a title, determining the business field to which the title belongs;
and extracting field words from the corpus as target words according to the extraction rules corresponding to the business fields.
In an embodiment, the step of determining, according to the set to which each of the first terms belongs, an inter-point mutual information value and a left-right cross entropy corresponding to each of the first terms includes:
determining respective first frequencies corresponding to each of the first words, wherein the first frequencies are frequencies of appearance of phrases in the first words in the set corresponding to the first words;
determining a second frequency and a third frequency corresponding to each first word, wherein the second frequency is the frequency of occurrence of a second word before the first word in the set to which the first word belongs, and the third frequency is the frequency of occurrence of a third word after the first word in the set to which the first word belongs;
and determining a mutual point information value corresponding to the first word according to each first frequency corresponding to the first word, and determining the left-right cross entropy of the first word according to the second frequency and the third frequency corresponding to the first word.
In an embodiment, the step of classifying each sentence in the corpus to obtain a plurality of sets includes:
constructing a fourth vector corresponding to each statement in the corpus;
and classifying the sentences in the corpus according to the fourth vectors to obtain a plurality of sets.
In order to achieve the above object, the present invention further provides a word searching apparatus, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the word searching method is implemented.
To achieve the above object, the present invention further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the word searching method as described above.
To achieve the above object, the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the word searching method as described above.
The method, the device and the computer-readable storage medium for searching words obtain a corpus, classify sentences in the corpus to obtain a plurality of sets, perform word segmentation on the sentences in each set to obtain first words corresponding to each set, and determine inter-point mutual information values and left-right cross entropies corresponding to the first words according to the set to which the first words belong; and determining that the first word is a new domain word which is not stored if the mutual information value between the points corresponding to the first word is greater than a first threshold value and the left-right cross entropy of the first word is greater than a second threshold value. In the invention, each sentence in the corpus is classified, so that the sentences belonging to the same service are positioned in one set, and the new words are searched through the sentences belonging to the same service, thereby avoiding the inaccurate searching of the new words caused by the small co-occurrence degree of the texts of each service, and also improving the searching accuracy of the new words.
Drawings
Fig. 1 is a schematic hardware structure diagram of a word search device according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of a search method according to the present invention;
FIG. 3 is a detailed flowchart of step S40 in the second embodiment of the word searching method of the present invention;
FIG. 4 is a detailed flowchart of step S10 in the third embodiment of the word searching method of the present invention;
fig. 5 is a detailed flowchart of step S30 in the fourth embodiment of the word searching method of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The main solution of the embodiment of the invention is as follows: obtaining a corpus, and classifying each statement in the corpus to obtain a plurality of sets, wherein each set comprises a plurality of statements, and each statement in each set belongs to the same service; performing word segmentation on the sentences in each set to obtain first words corresponding to each set; according to the set to which each first word belongs, determining a mutual information value between points and a left-right cross entropy corresponding to each first word; and determining that the first word is a target word when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, wherein the target word is a domain word which is not stored.
In the invention, each sentence in the corpus is classified, so that the sentences belonging to the same service are positioned in one set, and the new words are searched through the sentences belonging to the same service, thereby avoiding the inaccurate searching of the new words caused by the small co-occurrence degree of the texts of each service, and also improving the searching accuracy of the new words.
As shown in fig. 1, fig. 1 is a schematic diagram of a hardware structure of a search apparatus for words related to the embodiment of the present invention.
As shown in fig. 1, the embodiment of the present invention relates to a term searching device, which may be a terminal with data processing capability, such as a computer. The word searching device may include: a processor 101, such as a CPU, a communication bus 102, and a memory 103. Wherein the communication bus 102 is used for enabling connection communication between these components. Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the term lookup means and may include more or fewer components than shown, or a combination of certain components, or a different arrangement of components.
As shown in fig. 1, a computer program may be included in the memory 103, which is a kind of computer storage medium.
In the apparatus shown in fig. 1, the processor 101 may be configured to invoke a computer program stored in the memory 103 and perform the following operations:
obtaining a corpus, and classifying each statement in the corpus to obtain a plurality of sets, wherein each set comprises a plurality of statements, and each statement in each set belongs to the same service;
performing word segmentation on the sentences in each set to obtain each first word corresponding to each set;
according to the set to which each first word belongs, determining a mutual information value and a left-right cross entropy between points corresponding to each first word;
and determining that the first word is a target word when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, wherein the target word is a domain word which is not stored.
In one embodiment, the processor 101 may invoke a computer program stored in the memory 103, further performing the following operations:
when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, constructing a first vector corresponding to the first word;
determining the length of the first word, and constructing a second vector of the first word according to the length;
determining a third vector of the first word according to the first vector and the second vector of the first word, and inputting the third vector corresponding to each first word into a prediction model to obtain a label value corresponding to each first word output by the prediction model;
and when the label value corresponding to the first word is a preset value, determining that the first word is a target word.
In one embodiment, the processor 101 may invoke a computer program stored in the memory 103, further performing the following operations:
when the inter-point mutual information value corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, determining whether the first word is a preset combination, wherein the preset combination is a moving word combination, a famous word combination or a moving famous word combination;
and when the first word is a preset combination, constructing a first vector corresponding to the first word.
In one embodiment, the processor 101 may invoke a computer program stored in the memory 103, further performing the following operations:
obtaining a corpus and determining whether the corpus is a title or not;
and when the corpus is not a title, classifying each sentence in the corpus to obtain a plurality of sets.
In one embodiment, the processor 101 may invoke a computer program stored in the memory 103, further performing the following operations:
when the corpus is a title, determining the business field to which the title belongs;
and extracting field words from the corpus as target words according to the extraction rules corresponding to the business fields.
In one embodiment, the processor 101 may invoke a computer program stored in the memory 103, further performing the following operations:
determining respective first frequencies corresponding to each of the first words, wherein the first frequencies are frequencies of appearance of phrases in the first words in the set corresponding to the first words;
determining a second frequency and a third frequency corresponding to each first word, wherein the second frequency is the frequency of occurrence of a second word before the first word in the set to which the first word belongs, and the third frequency is the frequency of occurrence of a third word after the first word in the set to which the first word belongs;
and determining a mutual point information value corresponding to the first word according to each first frequency corresponding to the first word, and determining the left-right cross entropy of the first word according to the second frequency and the third frequency corresponding to the first word.
In one embodiment, the processor 101 may invoke a computer program stored in the memory 103, further performing the following operations:
constructing a fourth vector corresponding to each statement in the corpus;
and classifying the sentences in the corpus according to the fourth vectors to obtain a plurality of sets.
According to the scheme, a corpus is obtained, each sentence in the corpus is classified to obtain a plurality of sets, each sentence in each set is subjected to word segmentation to obtain each first word corresponding to each set, and then a mutual information value between points and a left-right cross entropy corresponding to each first word are determined according to the set to which each first word belongs; and determining that the first word is a new domain word which is not stored if the mutual information value between the points corresponding to the first word is greater than a first threshold value and the left-right cross entropy of the first word is greater than a second threshold value. In the invention, each sentence in the corpus is classified, so that the sentences belonging to the same service are positioned in one set, thereby searching for the new word through the sentences belonging to the same service, avoiding the inaccurate searching for the new word caused by the small co-occurrence degree of the texts of each service, and also improving the searching accuracy of the new word.
Based on the hardware architecture of the searching device of the words, the embodiment of the searching method of the words is provided.
Referring to fig. 2, fig. 2 is a first embodiment of the word searching method of the present invention, and the word searching method includes the following steps:
step S10, obtaining a corpus, and classifying each statement in the corpus to obtain a plurality of sets, wherein each set comprises a plurality of statements, and each statement in each set belongs to the same service.
In this embodiment, the execution subject is a lookup device of words. For ease of description, the term lookup device is used to refer to a device. The device may be any terminal with data processing capabilities, for example, the device may be a computer. The device obtains a corpus. The corpus refers to text for finding words, and the number of the text is one or more than two. The corpus may be a text pre-stored in the device, or a text formed by service data transmitted by an external device.
The text is composed of sentences, and thus, the corpus includes a plurality of sentences. The sentences in the corpus may be determined by commas as well as periods. The device classifies each sentence in the corpus to obtain a plurality of sets. Specifically, the device can cluster each statement by adopting a Kmeans algorithm to obtain multiple clusters of statements, and each cluster of statements is a set. The Kmeans algorithm clusters the sentences belonging to the same group, so that a plurality of sentences in each group belong to the same service.
And step S20, performing word segmentation on the sentences in each set to obtain each first word corresponding to each set.
After obtaining a plurality of sets, the device performs word segmentation on the sentences of each set. The word segmentation can be performed by adopting N-Gram word segmentation. The processing granularity of the N-Gram word segmentation can be super-parameter, and the processing granularity is different based on different services, and the processing granularity corresponding to each service can be set.
After the device divides words of each sentence in each set, one set can divide a plurality of words, and the words are defined as first words, namely the first set comprises a plurality of first words.
And S30, determining a mutual information value between points and a left-right cross entropy corresponding to each first word according to the set to which each first word belongs.
After obtaining each first term of each set, the device may determine, based on the set of first terms, a mutual information value between points and a left-right cross entropy corresponding to each first term.
The point-to-point mutual information value is mainly used for calculating semantic similarity between words, the basic idea is to count the probability of two words appearing in the text at the same time, if the probability is higher, the correlation is closer, and the correlation is higher.
Left-right cross entropy is generally used for new word discovery of a statistical method, and left entropy and right entropy between a pair of words are calculated, wherein the larger the entropy is, the more the new word is indicated. Because entropy represents uncertainty, the larger the entropy, the larger the uncertainty, i.e., the richer the word pair is collocated left and right, the more choices.
And S40, determining that the first word is a target word when the mutual information value between points corresponding to the first word is larger than a first threshold and the left-right cross entropy corresponding to the first word is larger than a second threshold, wherein the target word is a domain word which is not stored.
The apparatus may calculate a PMI value and left-right cross entropy corresponding to each first word. The device judges whether the PMI value corresponding to the first word is larger than a first threshold value. If the PMI value of the first word is greater than the first threshold, it can be determined that the probability that each word in the first word appears in the text at the same time is high, that is, the cohesion degree of each word of the first word is high and the co-linear frequency is high, so that the probability that the first word is a new word is high. The apparatus determines whether a left-right cross entropy corresponding to the first word is less than a second threshold. The left-right cross entropy is used for determining the boundary degree of a word, if the left-right cross entropy is larger, the first word and other words can be determined to be about rich in matching, and the possibility that the first word is used as a new word is higher. For this, if the left-right cross entropy corresponding to the first word is greater than the second threshold, it may be determined that the first word is the target word. The target word is a domain word that is not stored, i.e. the target word is a new word under the business domain of the set to which the first word belongs. The apparatus adds the target words to a candidate set of domain words, each target word in the candidate set being used for training of an LR (logistic regression) model, which may be a model for determining new words, for iteration of the LR model.
If the PMI value corresponding to the first word is smaller than or equal to the first threshold, the first word is not the target word, and the first word can be discarded. If the PMI value of the first word is larger than the first threshold value, but the left-right cross entropy of the first word is smaller than or equal to the second threshold value, the first word is not the target word, and the first word can be discarded.
In the technical scheme provided by this embodiment, a corpus is obtained, each sentence in the corpus is classified to obtain a plurality of sets, the sentences of each set are participled to obtain each first word corresponding to each set, and then a mutual information value between points and a left-right cross entropy corresponding to each first word are determined according to the set to which each first word belongs; and determining that the first word is a new domain word which is not stored if the mutual information value between the points corresponding to the first word is greater than a first threshold value and the left-right cross entropy of the first word is greater than a second threshold value. In the invention, each sentence in the corpus is classified, so that the sentences belonging to the same service are positioned in one set, and the new words are searched through the sentences belonging to the same service, thereby avoiding the inaccurate searching of the new words caused by the small co-occurrence degree of the texts of each service, and also improving the searching accuracy of the new words.
Referring to fig. 3, fig. 3 is a second embodiment of the searching method according to the word of the present invention, and based on the first embodiment, step S40 includes:
step S41, when the mutual information value between the points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, a first vector corresponding to the first word is constructed.
In this embodiment, when the inter-point mutual information value corresponding to the first word is greater than the first threshold and the left-right cross entropy corresponding to the first word is greater than the second threshold, the probability that the first word is a new word is greater, but the first word is not necessarily a target word. In this regard, the first word information may be predicted using a prediction model to determine whether the first word is a target word. The prediction model is an LR model, that is, the LR model can be updated iteratively by using the target words when the target words are determined.
The device constructs a vector corresponding to the first word in a TF-IDF (term-inverse document frequency, a common weighting technology for information retrieval and data mining), wherein the vector is defined as a first vector. The expression of TF-IDF is as follows:
wherein, w i For the word appearing in document i, tf (w) i ) For the frequency of occurrence of a word in document i,number of documents in which a word appears>Is the number of all documents.
And S42, determining the length of the first word, and constructing a second vector of the first word according to the length.
The apparatus determines the length of the first word after determining the first vector, thereby constructing a vector of the first word based on the length of the first word, the vector being defined as a second vector. The second vector may be a One-Hot vector.
And S43, determining third vectors of the first words according to the first vectors and the second vectors of the first words, and inputting the third vectors corresponding to each first word into the prediction model to obtain a label value corresponding to each first word output by the prediction model.
After the device obtains the first vector and the second vector of the first word, a third vector of the first word can be obtained according to the first vector and the second vector of the first word, and the third vector can be obtained by splicing the first vector and the second vector. The apparatus inputs the third vector for each first term to the predictive model. The prediction model predicts the first words based on the third vectors and the prediction model outputs label values for the first words, and thus the apparatus may derive the label value for each of the first words based on the prediction model.
And S44, when the label value corresponding to the first word is a preset value, determining that the first word is the target word.
When the tag value of the first word is a preset value, it can be determined that the first word is the target word. The preset value may be 1.
Furthermore, the field words are basically name word combinations, moving word combinations and moving name word combinations. Therefore, when the inter-point mutual information value corresponding to the first word is greater than the first threshold and the left-right cross entropy corresponding to the first word is greater than the second threshold, it is necessary to determine whether the first word is a preset combination, where the preset combination is a moving word combination, a famous word combination or a moving famous word combination. If the first word is the preset combination, the first corresponding first vector is constructed, that is, whether the first word is the target word needs to be determined based on the prediction model. If the first term is not the preset combination, the first term is not the target term, that is, the prediction model is not required to predict the first term. The determination of the preset combination can be analyzed by the open source word segmentation tool Hanlp, etc.
In the technical scheme provided by this embodiment, the device predicts each first word based on the prediction model, thereby accurately determining whether the first word is a new field word.
Referring to fig. 4, fig. 4 is a third embodiment of the searching method of the word in the present invention, and based on the first or second embodiment, step S10 includes:
and step S11, acquiring the corpus and determining whether the corpus is a title or not.
The data under each service is document data, and contains structural information such as document titles, paragraph texts, document ending terms, and document final terms. For example, in the physical property, a document title is "agent precious metal service", and a paragraph title is "popular introduction, service object, feature and advantage, transaction channel, transaction flow, and transaction data", wherein each title has a corresponding introduction document.
The number of the title words in the text is small, and the format and the words of the title are fixed, so that new words in the title can be acquired in a simpler and more convenient manner.
In this regard, the apparatus obtains a corpus and determines whether the corpus is a title.
And S12, classifying the sentences in the corpus to obtain a plurality of sets when the corpus is not the title.
The title is a short word, so that whether the number of a segment of characters is less than the preset number or whether the number of lines of the segment of characters is less than the preset number of lines is only required to be determined, and if so, the corpus can be determined to be the title. If not, the corpus is not a title.
When the corpus is not the title, classifying the sentences in the corpus to obtain a plurality of sets, that is, classifying the sentences in the corpus to find the target words.
In this embodiment, a title extraction rule is set, and the extraction rules of different business fields are different. And setting the extraction rule of the business field based on the setting rule of the document title and the paragraph title of the business field. When the corpus is a title, the device determines a business field to which the title belongs, and accordingly extracts a field word from the corpus as a target word based on an extraction rule corresponding to the business field.
In the technical scheme provided by the embodiment, the device determines whether the corpus is a title or not after acquiring the corpus, if the corpus is not the title, new words are searched based on a classification mode, and if the corpus is the title, the target words are directly extracted based on the extraction rule, so that the omission of the new words in the title is avoided.
Referring to fig. 5, fig. 5 is a fourth embodiment of the searching method according to the above words, and based on any one of the first to third embodiments, step S30 includes:
step S31, determining each first frequency corresponding to each first word, where the first frequency is a frequency of appearance of a word group in the first word in a set corresponding to the first word.
Step S32, determining a second frequency and a third frequency corresponding to each first word, where the second frequency is a frequency of a second word before the first word appearing in the set to which the first word belongs, and the third frequency is a frequency of a third word after the first word appearing in the set to which the first word belongs.
Step S33, determining the mutual information value between points corresponding to the first word according to each first frequency corresponding to the first word, and determining the left-right cross entropy of the first word according to the second frequency and the third frequency corresponding to the first word.
In this embodiment, the device determines the PMI value and left-right cross entropy for the first word. Specifically, the device determines respective first frequencies corresponding to each first term, where the first frequencies are frequencies of occurrences of words in the first terms in the set corresponding to the first terms. Each word of the first term corresponds to a first frequency.
The apparatus then determines a second frequency and a third frequency for each first term. The second frequency is a frequency with which a second word preceding the first word appears in the set to which the first word belongs, and the third frequency is a frequency with which a third word following the first word appears in the set to which the first word belongs. It will be understood that the second term is the term to the left of the first term and the third term is the term to the right of the first term.
The apparatus determines a PMI value corresponding to the first word according to each first frequency of the first word. PMI value = P (x, y) = log 2 P (xy)/(P (x) P (y)), where P (x, y) is the PMI value, x and y are the words in the first word, a word group may be a word, P (x) is the first frequency of the word group x, P (y) is the first frequency of y, and P (xy) is the first frequency of the first word (the first word made up of xy is also a word group).
The apparatus may calculate a left-right cross entropy of the first word when the PMI value of the first word is greater than a first threshold. The calculation formula of the left-right cross entropy is as follows:
wherein x is left 、x right All the left adjacent words and all the right adjacent words of the phrase x are respectively. Separately counting x, y, x in advance left 、x right The word frequency of (c).
In the technical solution provided in this embodiment, the apparatus determines each of the second frequency, the third frequency, and the first frequency of the first word, accurately determines the PMI value of the first word based on the first frequency, and accurately determines the left-right cross entropy of the first word according to the second frequency and the third frequency.
In an embodiment, the step of classifying each sentence in the corpus to obtain a plurality of sets includes:
constructing a fourth vector corresponding to each statement in the corpus;
and classifying the sentences in the corpus according to the fourth vectors to obtain a plurality of sets.
In this embodiment, when classifying a sentence, the apparatus classifies the sentence based on the vector of the sentence. The vector of the statement is defined as the fourth vector. The device constructs a fourth vector corresponding to each statement in the corpus.
For example, in the word stock formed by "transacting data, transacting flow and transacting channel" as [ "transacting", "managing", "data", "stream", "procedure", "channel" and "channel", the fourth vector quantities constructed for "transacting data", "transacting flow" and "transacting channel" are respectively:
[[0,0,0.39,0.39,0,0,0,0]、[0、0、0、0、0.39、0.39、0、0]、[0、0、0、0、0,0、0.39、0.39]]。
the device can classify each sentence in the corpus according to each fourth vector to obtain a set. The classification manner may calculate a euclidean distance between any two fourth vectors, and the statements corresponding to the two fourth vectors whose euclidean distances are smaller than the preset threshold are used as a class, and it can be understood that the euclidean distance between every two statements in one set is smaller than the preset threshold.
In the technical scheme provided by this embodiment, the device constructs the fourth vector corresponding to each sentence in the corpus, so that each sentence is accurately classified based on the fourth vector, and the accuracy of searching for new words is improved.
The invention also provides a word searching device, which comprises a processor, a memory and a computer program stored in the memory and capable of running on the processor, wherein when the computer program is executed by the processor, the computer program realizes the steps of the word searching method.
The present invention also provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the word searching method of the above embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (9)
1. A method for searching words is characterized by comprising the following steps:
obtaining a corpus, and classifying each statement in the corpus to obtain a plurality of sets, wherein each set comprises a plurality of statements, and each statement in each set belongs to the same service;
performing word segmentation on the sentences in each set to obtain each first word corresponding to each set;
according to the set to which each first word belongs, determining a mutual information value and a left-right cross entropy between points corresponding to each first word;
and determining that the first word is a target word when the inter-point mutual information value corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, wherein the target word is a domain word which is not stored.
2. The method according to claim 1, wherein the mutual information value between points corresponding to the first term is greater than a first threshold, and the left-right cross entropy corresponding to the first term is greater than a second threshold, and the step of determining that the first term is the target term comprises:
when the mutual information value between points corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, constructing a first vector corresponding to the first word;
determining the length of the first word, and constructing a second vector of the first word according to the length;
determining a third vector of the first word according to the first vector and the second vector of the first word, and inputting the third vector corresponding to each first word into a prediction model to obtain a label value corresponding to each first word output by the prediction model;
and when the label value corresponding to the first word is a preset value, determining that the first word is a target word.
3. The method according to claim 2, wherein when the mutual information value between points corresponding to the first term is greater than a first threshold and the left-right cross entropy corresponding to the first term is greater than a second threshold, the step of constructing the first vector corresponding to the first term includes:
when the inter-point mutual information value corresponding to the first word is larger than a first threshold value and the left-right cross entropy corresponding to the first word is larger than a second threshold value, determining whether the first word is a preset combination, wherein the preset combination is a moving word combination, a famous word combination or a moving famous word combination;
and when the first word is a preset combination, constructing a first vector corresponding to the first word.
4. The sentence searching method according to claim 1, wherein the step of obtaining the corpus and classifying each sentence in the corpus to obtain a plurality of sets comprises:
obtaining a corpus and determining whether the corpus is a title or not;
and when the corpus is not a title, classifying each sentence in the corpus to obtain a plurality of sets.
5. The method of claim 4, wherein after the step of determining whether the corpus is a title, the method further comprises:
when the corpus is a title, determining the business field to which the title belongs;
and extracting field words from the corpus as target words according to the extraction rules corresponding to the business fields.
6. The method for searching for words according to any one of claims 1-5, wherein the step of determining the mutual information value between points and the left-right cross entropy corresponding to each of the first words according to the set to which each of the first words belongs comprises:
determining respective first frequencies corresponding to each of the first words, wherein the first frequencies are frequencies at which phrases in the first words appear in the set corresponding to the first words;
determining a second frequency and a third frequency corresponding to each first word, wherein the second frequency is the frequency of the occurrence of a second word before the first word in the set to which the first word belongs, and the third frequency is the frequency of the occurrence of a third word after the first word in the set to which the first word belongs;
and determining a mutual point information value corresponding to the first word according to each first frequency corresponding to the first word, and determining the left-right cross entropy of the first word according to the second frequency and the third frequency corresponding to the first word.
7. The method according to any one of claims 1-5, wherein the step of classifying each sentence in the corpus into a plurality of sets comprises:
constructing a fourth vector corresponding to each statement in the corpus;
and classifying the sentences in the corpus according to the fourth vectors to obtain a plurality of sets.
8. An apparatus for searching for a word, the apparatus comprising a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein when executed by the processor, the computer program implements a method for searching for a word according to any one of claims 1-7.
9. A computer-readable storage medium having stored thereon computer-executable instructions for implementing a method of finding a term as claimed in any one of claims 1 to 7 when executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210027820.4A CN115858771A (en) | 2022-01-11 | 2022-01-11 | Word searching method and device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210027820.4A CN115858771A (en) | 2022-01-11 | 2022-01-11 | Word searching method and device and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115858771A true CN115858771A (en) | 2023-03-28 |
Family
ID=85659950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210027820.4A Pending CN115858771A (en) | 2022-01-11 | 2022-01-11 | Word searching method and device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115858771A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649666A (en) * | 2016-11-30 | 2017-05-10 | 浪潮电子信息产业股份有限公司 | Left-right recursion-based new word discovery method |
CN108984514A (en) * | 2017-06-05 | 2018-12-11 | 中兴通讯股份有限公司 | Acquisition methods and device, storage medium, the processor of word |
CN110008464A (en) * | 2019-01-02 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Construction method, device, server and the readable storage medium storing program for executing of business dictionary |
CN110909540A (en) * | 2018-09-14 | 2020-03-24 | 阿里巴巴集团控股有限公司 | Method and device for identifying new words of short message spam and electronic equipment |
-
2022
- 2022-01-11 CN CN202210027820.4A patent/CN115858771A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649666A (en) * | 2016-11-30 | 2017-05-10 | 浪潮电子信息产业股份有限公司 | Left-right recursion-based new word discovery method |
CN108984514A (en) * | 2017-06-05 | 2018-12-11 | 中兴通讯股份有限公司 | Acquisition methods and device, storage medium, the processor of word |
CN110909540A (en) * | 2018-09-14 | 2020-03-24 | 阿里巴巴集团控股有限公司 | Method and device for identifying new words of short message spam and electronic equipment |
CN110008464A (en) * | 2019-01-02 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Construction method, device, server and the readable storage medium storing program for executing of business dictionary |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11093854B2 (en) | Emoji recommendation method and device thereof | |
CN104615608B (en) | A kind of data mining processing system and method | |
CN107301170B (en) | Method and device for segmenting sentences based on artificial intelligence | |
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN109299228B (en) | Computer-implemented text risk prediction method and device | |
CN112395385B (en) | Text generation method and device based on artificial intelligence, computer equipment and medium | |
CN109271514B (en) | Generation method, classification method, device and storage medium of short text classification model | |
KR20160121382A (en) | Text mining system and tool | |
CN107885717B (en) | Keyword extraction method and device | |
CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
CN112860896A (en) | Corpus generalization method and man-machine conversation emotion analysis method for industrial field | |
CN115168567B (en) | Knowledge graph-based object recommendation method | |
CN110874532A (en) | Method and device for extracting keywords of feedback information | |
CN114385791A (en) | Text expansion method, device, equipment and storage medium based on artificial intelligence | |
WO2019163642A1 (en) | Summary evaluation device, method, program, and storage medium | |
CN113158667B (en) | Event detection method based on entity relationship level attention mechanism | |
CN113934848A (en) | Data classification method and device and electronic equipment | |
CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment | |
CN111523311B (en) | Search intention recognition method and device | |
CN112560425A (en) | Template generation method and device, electronic equipment and storage medium | |
WO2023245869A1 (en) | Speech recognition model training method and apparatus, electronic device, and storage medium | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
CN115577109A (en) | Text classification method and device, electronic equipment and storage medium | |
CN115858771A (en) | Word searching method and device and computer readable storage medium | |
Lalrempuii et al. | Sentiment classification of crisis related tweets using segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230328 |
|
RJ01 | Rejection of invention patent application after publication |