CN105260362A - New word extraction method and device - Google Patents

New word extraction method and device Download PDF

Info

Publication number
CN105260362A
CN105260362A CN201510729084.7A CN201510729084A CN105260362A CN 105260362 A CN105260362 A CN 105260362A CN 201510729084 A CN201510729084 A CN 201510729084A CN 105260362 A CN105260362 A CN 105260362A
Authority
CN
China
Prior art keywords
candidate
degree
lemma
lemmas
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510729084.7A
Other languages
Chinese (zh)
Other versions
CN105260362B (en
Inventor
赵旭海
孟超
王海洲
张寅�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaomi Inc
Original Assignee
Xiaomi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Inc filed Critical Xiaomi Inc
Priority to CN201510729084.7A priority Critical patent/CN105260362B/en
Publication of CN105260362A publication Critical patent/CN105260362A/en
Application granted granted Critical
Publication of CN105260362B publication Critical patent/CN105260362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a new word extraction method and device. The new word extraction method comprises the following steps: calculating the cohesion degree of multiple candidate lexical units, wherein the cohesion degree represents the probability that the candidate lexical units are used as fixed words or set phrases; calculating the degree of freedom of the multiple candidate lexical units, wherein the degree of freedom represents the flexibility of the matching between the candidate lexical units and the fixed words or set phrases, and if the value of the degree of freedom is higher, the number of the fixed words or set phrases capable of being matched with the candidate lexical units is more; respectively performing weighted calculation on the calculated cohesion degree and degree of freedom of all the candidate lexical units, so as to obtain the weighted sum; extracting candidate words or candidate phrases from the candidate lexical units based on the calculated weighted sum. According to the invention, the new candidate words or candidate phrases can be extracted from the candidate lexical units more intelligently, and the extraction accuracy of the candidate words or candidate phrases can be remarkably improved.

Description

New word extraction method and device
Technical Field
The disclosure relates to the field of terminals, in particular to a new word extraction method and device.
Background
With the increase of the types of goods sold by e-commerce websites and the progress of daily word and sentence searching of users, a plurality of unique brand names and popular word collocations appear. These word collocations are usually not stored in the original word segmentation word list of the search engine, so that the search engine may not be accurately split for some specific word collocations searched by the user, and the search result may not meet the expectation of the user.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a new word extraction method and apparatus.
According to a first aspect of the embodiments of the present disclosure, there is provided a new word extraction method, including:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
Optionally, the method further includes:
obtaining a corpus;
and carrying out lexical element splitting on the obtained corpus to obtain the candidate lexical elements.
Optionally, the corpus includes a commodity name and a user search log.
Optionally, the calculating the aggregation degrees of the multiple candidate lemmas includes:
counting the occurrence times of the candidate lemmas in all the linguistic data;
sequentially selecting the candidate word elements as target candidate word elements;
and calculating the degree of cohesion of the selected target candidate lemmas according to a degree of cohesion calculation formula.
Wherein, the calculation formula of the degree of agglomeration is as follows:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
or,
S ( A 1 A 2 ... A n ) = max { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
said S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); the P (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora; the P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora; the occurrence probability is the ratio of the occurrence times in all the corpora to the total length of the corpora; the value of n represents the number of characters or letters constituting the candidate lemma.
Optionally, the calculating the degrees of freedom of the plurality of candidate lemmas includes:
recording adjacent characters of the candidate lemmas, and counting the occurrence times of the adjacent characters of the candidate lemmas in all linguistic data;
sequentially selecting the candidate word elements as target candidate word elements;
calculating the occurrence probability of the adjacent characters of the selected target candidate lemmas in all linguistic data;
calculating information entropy of the adjacent word based on the calculated occurrence probability;
and adding the calculated information entropies of all the adjacent words of the target candidate lemma to obtain the degree of freedom of the target candidate lemma.
Optionally, the weighting calculation of the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas to obtain a weighted sum includes:
respectively carrying out weighted calculation on the aggregation degree and the freedom degree of each candidate lemma in the candidate lemma according to a preset weighting formula to obtain a weighted sum;
the preset weighting formula is as follows:
F(ω)=S(ω)*ω1+I(ω)*ω2
or,
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
Optionally, the correction parameter is the occurrence probability of the candidate lemma in all the corpora;
the method further comprises the following steps:
judging whether the total length of the corpus is greater than a preset threshold value or not;
when the total length of the linguistic data is lower than a preset threshold value, the weight proportion of the occurrence probability of the candidate lemmas in all the linguistic data is improved based on a preset amplitude;
and when the total length of the linguistic data is greater than a preset threshold value, reducing the weight proportion of the occurrence probability of the candidate lemmas in all the linguistic data based on a preset amplitude.
Optionally, the extracting candidate words or candidate phrases from the candidate lemmas based on the weighted sum obtained by the calculation includes:
sorting the weighted sums obtained by calculation according to the numerical value;
extracting m candidate lemmas with the highest weighted sum in the candidate lemmas as candidate words or candidate phrases based on the ordering;
wherein the value of m is set by a user.
Optionally, the method further includes:
outputting the extracted candidate words or candidate phrases on a manual review interface;
and importing the candidate words or candidate phrases which are checked in the manual checking interface into spelling suggestions or word segmentation word lists of a search engine.
According to a second aspect of the embodiments of the present disclosure, there is provided a new word extraction apparatus, the apparatus including:
a first calculation module configured to calculate a degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
a second calculation module configured to calculate degrees of freedom of the plurality of candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
a third calculation module configured to perform weighted calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculation module and the second calculation module to obtain a weighted sum;
an extracting module configured to extract candidate words or candidate phrases from the plurality of candidate lemmas based on the weighted sum calculated by the third calculating module.
Optionally, the apparatus further comprises:
a first calculation module configured to calculate a degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
a second calculation module configured to calculate degrees of freedom of the plurality of candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
a third calculation module configured to perform weighted calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculation module and the second calculation module to obtain a weighted sum;
an extracting module configured to extract candidate words or candidate phrases from the plurality of candidate lemmas based on the weighted sum calculated by the third calculating module.
Optionally, the apparatus further comprises:
an acquisition module configured to acquire a corpus;
and the splitting module is configured to split the morphemes of the corpus acquired by the acquiring module to obtain the candidate morphemes.
Optionally, the corpus includes a commodity name and a user search log.
Optionally, the first computing module includes:
the first statistic submodule is configured to count the occurrence times of the candidate lemmas in all the linguistic data;
a first selection submodule configured to sequentially select the plurality of candidate lemmas as target candidate lemmas;
a first calculation submodule configured to calculate the degree of aggregation of the target candidate lemma selected by the first selected submodule according to an aggregation calculation formula.
Wherein, the calculation formula of the degree of agglomeration is as follows:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
or,
S ( A 1 A 2 ... A n ) = max { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
said S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); the P (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora; the P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora; the occurrence probability is the ratio of the occurrence times in all the corpora to the total length of the corpora; the value of n represents the number of characters or letters constituting the candidate lemma.
Optionally, the second computing module includes:
the second statistic submodule is configured to record adjacent words of the candidate lemmas and count the occurrence times of the adjacent words of the candidate lemmas in all the linguistic data;
a second selection submodule configured to sequentially select the plurality of candidate lemmas as target candidate lemmas;
a second calculating submodule configured to calculate occurrence probabilities of adjacent words of the target candidate lemma selected by the second selecting submodule in all the corpora;
a third calculation submodule configured to calculate an information entropy of the adjacent word based on the occurrence probability calculated by the second calculation submodule;
an adding sub-module configured to add the information entropies of all the adjacent words of the target candidate lemma calculated by the third calculation module to obtain the degree of freedom of the target candidate lemma.
Optionally, the third computing module includes:
the weighting submodule is configured to perform weighting calculation on the aggregation degree and the freedom degree of each candidate lemma in the plurality of candidate lemmas calculated by the first calculation module and the second calculation module according to a preset weighting formula to obtain a weighted sum;
the preset weighting formula is as follows:
F(ω)=S(ω)*ω1+I(ω)*ω2
or,
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
Optionally, the correction parameter is the occurrence probability of the candidate lemma in all the corpora;
the third computing module further comprises:
the judging submodule is configured to judge whether the total length of the corpus is greater than a preset threshold value;
the improvement submodule is configured to improve the weight proportion of the occurrence probability of the candidate lemma in all the corpora based on a preset amplitude when the total length of the corpora is lower than a preset threshold;
and the reducing submodule is configured to reduce the weight proportion of the occurrence probability of the candidate lemma in all the corpuses based on a preset amplitude when the total length of the corpuses is greater than a preset threshold value.
Optionally, the extracting module includes:
the sorting submodule is configured to sort the weighted sum calculated by the third calculation module according to the numerical value;
an extraction sub-module configured to extract m candidate lemmas with the highest weighted sum among the plurality of candidate lemmas as candidate words or candidate phrases based on the ranking; wherein the value of m is set by a user.
Optionally, the extracting module further includes:
the output sub-module is configured to output the candidate words or candidate phrases extracted by the extraction sub-module on a manual review interface;
and the import sub-module is configured to import the candidate words or candidate word groups which are approved in the manual review interface into spelling suggestions or word segmentation lists of a search engine.
According to a third aspect of the embodiments of the present disclosure, there is also provided a new word extraction device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
in the above embodiments of the present disclosure, the degree of aggregation of the plurality of candidate lemmas is calculated; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method of new word extraction, according to an example embodiment;
FIG. 2 is a flow diagram illustrating another method of extracting new words in accordance with an illustrative embodiment;
FIG. 3 is a schematic block diagram of a new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 4 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 5 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 6 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 7 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 8 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 9 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
FIG. 10 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;
fig. 11 is a schematic structural diagram illustrating an apparatus for extracting new words according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In the existing implementation, for word collocations which are not stored in the original word segmentation word list of the search engine, when the search engine splits the word collocations, the word collocations may not be split accurately, so that the search result does not meet the expectation of the user.
In order to solve the above problems, it is a common practice at present to find and extract latest word collocations from a corpus in time according to a new word finding strategy, and then introduce the extracted newly appearing word collocations into a participle word list of a search engine.
In the related technology, the current new word discovery strategy is usually discovered based on word frequency statistics, if a plurality of characters or words are collocated as words and appear for a plurality of times, the characters or words are likely to be fixed words or fixed word groups, therefore, the first N most frequently appearing words or words are extracted by sequencing the appearing times of the characters or words, existing words or word groups in a word segmentation word list of a search engine are removed, and the rest words or word groups are new words or word groups.
However, the above new word discovery strategy does not consider the inseparable degree inside the word, i.e. whether the word is a common fixed collocation or not, and also does not consider whether the word can flexibly appear in different contexts, so that the accuracy of new word extraction is poor, and missing or error situations are easy to occur.
In view of the above, the present disclosure provides a new word extraction method, which calculates the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.
As shown in fig. 1, fig. 1 is a new word extraction method according to an exemplary embodiment, where the new word extraction method is used in a server, and includes the following steps:
in step 101, calculating the aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes a probability of the candidate token as a fixed word or a fixed phrase.
In step 102, calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be collocated with.
In step 103, the calculated aggregation and the calculated degree of freedom of each of the plurality of candidate lemmas are weighted to obtain a weighted sum.
In step 104, candidate words or candidate phrases are extracted from the candidate lemmas based on the calculated weighted sum.
The server side can comprise a server, a server cluster or a cloud platform constructed based on the server cluster, wherein the server provides search services for users. For example, in an application scenario of an e-commerce website, the server may be a server providing a commodity search service to users. The server can split the keywords input by the user through a word segmentation word list in the search engine, and searches and outputs matched commodities for the user in the commodities to be sold according to the split lemmas.
The technical solution of the present disclosure is described in detail below with an application scenario of an e-commerce website, in which the server is a server that provides a commodity search service for a user as an example.
In an application scenario of an e-commerce network, the candidate lemma may be a candidate word or a candidate phrase obtained by a server splitting the lemma for the obtained corpus.
The corpus may include names of commodities and search logs of users. When the server acquires the corpus, the name of the commodity to be sold on the new shelf and the search log of the user can be periodically extracted.
For the commodity names, the server may periodically extract the commodity names under different commodity classifications, and then may write the extracted commodity names into different texts according to the commodity classifications in a certain order, for example, each line in the text corresponds to one commodity name. For the search log of the user, the server can record the search words input by each search operation of the user, and then write the same search words in the recorded search words into the text according to a certain sequence after merging the same search words; for example, there may still be one search term per line in the text.
In the present disclosure, the obtained corpus may be preprocessed by the server. When preprocessing the corpus, the server can split the obtained corpus into a plurality of candidate lemmas. The specific splitting mode of the server when performing lemma splitting on the material can be different splitting modes.
In an embodiment shown in the present disclosure, the server may set a maximum splitting length and a minimum splitting length, both of which may be set by a user in a customized manner, where the maximum splitting length is a maximum length of the split lemmas, and the minimum splitting length is a minimum length of the split lemmas. The minimum splitting length can be set to be 2, so that the situation that single words do not exist in the finally split lemmas is guaranteed.
When the minimum splitting length and the maximum splitting length are set, a length range of the lexeme splitting can be formed, and when the server splits the lexemes, the server can split the lexemes according to each splitting length value in the length range of the lexeme splitting so as to split the lexemes into a plurality of candidate lexemes.
For example, assuming that the corpus acquired by the server includes a commodity name "millet mobile phone version", the maximum splitting length set by the user is 4, and the minimum splitting length is 4, when the server splits the lemma of the commodity name, the server may split the commodity name into three candidate lemmas, namely "millet", "mobile phone" and "version", according to the minimum splitting length 2, wherein the candidate lemma "version" may be discarded because the length is less than 2. After the name of the commodity is split according to the minimum splitting length 2, the name of the commodity can be continuously split into a small hand, a mobile phone and a version according to the splitting length 3. After the name of the commodity is split according to the splitting length 3, the name of the commodity can be continuously split into two morphemes of 'millet mobile phone' and 'mobile version' according to the maximum splitting length 4. At this time, the splitting of the commodity name is completed, and the finally split candidate lemmas include 6 candidate lemmas such as "millet", "mobile phone", "millet hand", "mobile phone", "millet mobile phone", and "mobile version".
After the server completes the splitting of the obtained corpus according to the above mode, a plurality of candidate lemmas can be obtained. At this time, the server may record the adjacent words of each candidate lemma respectively, and count the recorded adjacent words of each candidate lemma and the occurrence number of each candidate lemma in all the corpora.
The adjacent characters of each candidate word element comprise left adjacent characters and right adjacent characters, one candidate word element may have a plurality of adjacent words, and for the candidate word elements which can be flexibly matched with other fixed words or fixed word groups, the adjacent characters of the candidate word element may be richer; for example, for the candidate word element "millet", it may be configured with "box" as a fixed phrase "millet box", or with "mobile phone" as a fixed phrase "millet mobile phone". Thus, for a candidate token, the more words or phrases it can match, the more the candidate token is adjacent.
When recording the adjacent words of each candidate lemma, the server can realize the reverse search in the corpus to which the candidate lemma belongs. For example, suppose the corpus is the name of a commodity, "millet mobile phone version", the candidate lemma is "mobile phone", and the left side adjacent word of the candidate lemma is the character "meter", and the right side adjacent word is the character "move". Of course, for the lemmas without left or right adjacent words, such as candidate lemmas "millet" and "mobile version", the server may mark the left or right non-existing adjacent words in the record when recording the adjacent words.
When the server counts the occurrence frequency of the candidate lemma or the adjacent word in all the linguistic data, the server can scan the candidate lemma or the adjacent word in all the linguistic data as an index, and the occurrence frequency is increased by one every time the candidate lemma or the adjacent word is scanned. For example, assuming that the candidate lemma is "millet", the server may scan the commodity name and the user search log in all the corpora using "millet" as a keyword, and may increment the number of occurrences of the lemma by one every time the scan is completed. Assuming that the adjacent word is'm', the server can scan the commodity name and the user search log in all the linguistic data by taking'm' as a key word, and the occurrence number of the lemma can be increased by one every time the commodity name and the user search log are scanned.
For the recorded adjacent words of each candidate lemma, the counted occurrence frequency of each candidate lemma and the adjacent words of each candidate lemma in all the linguistic data can be stored in the same text together with the candidate lemma by the server. For example, the server may store the candidate lemmas in text, where each line of the text corresponds to a respective candidate lemma. The server can store the recorded adjacent characters, the counted occurrence frequency of the candidate word element and the candidate word element in the same line, so that the server can conveniently read the candidate word element.
The foregoing describes a preprocessing process of the server on the split candidate lemmas.
In this disclosure, after the server finishes the preprocessing process of the split candidate lemmas, the server may calculate the aggregation and the degree of freedom of each candidate lemma based on the counted parameters such as the number of occurrences of the candidate lemma, the adjacent words of the candidate lemma, and the number of occurrences of the adjacent words.
Wherein, the degree of aggregation characterizes the probability of the candidate word element as a fixed word or a fixed phrase. The indicator can be used for evaluating whether a lemma is a common fixed collocation, namely whether characters forming the lemma are accidentally pieced together. For example, if character a is unrelated to character B, then the probability that they will accidentally form a word AB is p (a) x p (B), where p (a) represents the probability that a appears in the corpus. If p (AB) is calculated to be much greater than p (a) p (B), then it can be assumed that the word AB is not a random spelling of the character a and the character B, but rather a fixed word collocation.
Based on this, in one embodiment shown, the server, when calculating the degree of aggregation of the candidate lemmas, may perform the calculation by the following degree of aggregation calculation formula:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
in the above formula, S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); p (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora. Accordingly, P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora, by analogy, P (A)1A2) And P (A)3…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1A2]And a character string [ A3…An]In all languagesProbability of occurrence in the material.
In the present disclosure, the occurrence probability may be expressed by a ratio of the occurrence number in all the corpuses to the total length of the corpuses; for example, when calculating the probability of occurrence of a candidate lemma, the server may use the ratio of the number of occurrences of the candidate lemma to the total length of the corpus, which is obtained by statistics in the preprocessing stage of the corpus. When calculating the occurrence probability of the adjacent word, the server may use the ratio of the number of occurrences of the adjacent word counted in the preprocessing stage of the corpus to the total length of the corpus to represent the adjacent word. The value of n in the above formula represents the number of characters or letters of the candidate lemma. For example, assume that the candidate lemma is "millet" when the n value is 2.
When the server calculates the aggregation of the candidate lemmas by using the above formula, the server may sequentially select all the candidate lemmas obtained by splitting as target candidate lemmas, and then calculate the aggregation of all the candidate lemmas obtained after splitting by using the above formula for each selected target candidate lemma.
When the server calculates the degree of aggregation of the target candidate lemmas, the server can firstly continue to split the candidate lemmas. To calculate candidate lemma [ A ]1A2A3A4]For example, the server may split the candidate token into [ A ]1]And [ A ]2A3A4]、[A1A2]And [ A ]3A4]And [ A1A2A3]And [ A ]4]And 6 combinations.
When the split is complete, the server can compute [ A ] separately1]、[A2A3A4]、[A1A2]、[A3A4]、[A1A2A3]And [ A ]4]Probability of occurrence in all corpora. Wherein, the candidate lemma [ A ] is calculated1A2A3A4]Check out of [ A ]1]、[A2A3A4]、[A1A2]、[A3A4]、[A1A2A3]And [ A ]4]The specific process of occurrence probability and the calculation of candidate lemma [ A ]1A2A3A4]The specific process of occurrence probability of (c) is the same. For example, the server is calculating [ A ]1]Probability of occurrence of P (A)1) Then [ A ] can be counted again1]And calculating the ratio of the counted occurrence times to the total length of the linguistic data.
When the server calculates the split [ A ]1]And [ A ]2A3A4]、[A1A2]And [ A ]3A4]And [ A1A2A3]And [ A ]4]After the occurrence probabilities of the 6 combinations are equal, the calculated occurrence probabilities can be substituted into the above formula to calculate the candidate lemma [ A1A2A3A4]Degree of aggregation of (a).
For example, assuming that the candidate lemma is "millet cell phone", the server may split the candidate lemma into 6 combinations of "small", "rice cell phone", "millet", "cell phone", "millet cell phone", and "machine", instead of calculating the occurrence probabilities of the six combinations in all corpora, and then substituting the calculated occurrence probabilities into the above formula for calculation. When the calculated occurrence probabilities of the 6 combinations are substituted into the formula, 3 groups of values can be obtained.
The value of group 1 is the ratio of the occurrence probability of the candidate lemma "millet cell phone" and the product of the occurrence probabilities of "small" and "rice cell phone" in the combination of the above 6.
The value of group 2 is the ratio of the occurrence probability of the candidate lemma "millet mobile phone" and the product of the occurrence probabilities of "millet" and "mobile phone" in the combination of the above 6.
The value of group 3 is the ratio of the occurrence probability of the candidate lemma 'millet mobile phone' and the product of the occurrence probabilities of the 'millet mobile phone' and the 'mobile phone' in the combination of the group 6.
The minimum value of the 3 sets of values can be taken as the degree of aggregation of the candidate lemma "millet mobile phone" based on the above formula.
Note that, in implementation, the maximum value among the 3 sets of values may be used as the degree of aggregation of the candidate token "millet mobile phone", and is not particularly limited in the present disclosure.
Therefore, in another embodiment shown, the above formula for calculating the degree of agglomeration may also be expressed as the following formula, and the specific calculation process is not described again:
S ( A 1 A 2 ... A n ) = max { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
the above describes the calculation process of the degree of aggregation of the candidate lemmas, and in this way, the degree of aggregation of each candidate lemma can be calculated.
In the present disclosure, after the server calculates the degree of aggregation of each candidate lemma, the server may calculate the degree of freedom of each candidate lemma based on the parameters of the adjacent word of the candidate lemma that have been counted, the number of occurrences of the adjacent word, and the like.
The degree of freedom represents the flexibility of matching the candidate word element with the fixed word or the fixed phrase. The index is used for measuring whether a candidate lemma can flexibly appear in various different contexts, namely whether the left and right collocation of the candidate lemma is flexible, if the freedom degree value of the candidate lemma is low, the new word is possibly only one part of a certain fixed word collocation, and the higher the freedom degree value of the candidate lemma is, the more fixed words or fixed word groups can be collocated with the candidate lemma. For example, if the candidate lemma is "millet", if the candidate lemma appears 363 times in all the names of goods and 17 different words or phrases can be collocated on the right side of the candidate lemma, the degree of freedom of the word "millet" can be considered to be high.
In this disclosure, when calculating the degrees of freedom of the candidate lemmas, the server may sequentially select all the candidate lemmas obtained by splitting as target candidate lemmas, and then calculate the degrees of freedom for each selected target candidate lemma, so that the degrees of freedom of all the candidate lemmas obtained after splitting may be calculated.
When the server calculates the degree of freedom of the target candidate lemma, the server may first read the neighboring words of the target candidate lemma recorded in the corpus preprocessing stage, at this time, the target candidate lemma may have a plurality of neighboring words, and the server may calculate the information entropy of each neighboring word of the target candidate lemma respectively.
When the information entropy of the adjacent word is obtained, the information entropy can be obtained by the following calculation formula of the information entropy:
I(ω)=-ΣP*log(p)
in the above formula, I (ω) represents the calculated information entropy. P represents the probability of occurrence of adjacent words of the target candidate lemma in all corpora. The base of the log function in the above formula is not particularly limited in this disclosure, and may be 2 or other values, and those skilled in the art may refer to the description in the prior art.
After the server calculates the information entropies of all the adjacent words of the target candidate lemma through the above formula, the information entropies of many adjacent words of the target candidate lemma can be added, and then the added sum is used as the degree of freedom of the target candidate lemma. It can be seen that if the number of adjacent words of the target candidate lemma is more, the degree of freedom of the target candidate lemma calculated finally is higher, and at this time, the number of fixed words or fixed word groups that can be collocated with the target candidate lemma is more.
The above describes the calculation process of the degrees of freedom of the candidate lemmas, and in this way, the degree of freedom of each candidate lemma can be calculated.
In the disclosure, after the server calculates the aggregation and the degree of freedom of each candidate lemma in all the split candidate lemmas, the server may perform weighted calculation on the aggregation and the degree of freedom of each candidate lemma to obtain weighted sums, and then may extract a new candidate word or a new candidate phrase from all the candidate lemmas according to the weighted sums obtained by the calculation
In an embodiment shown, when the server performs weighting calculation on the aggregation degree and the degree of freedom of each candidate lemma respectively, the weighting calculation may be performed by the following weighting formula:
F(ω)=S(ω)*ω1+I(ω)*ω2
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter. Wherein, in the above formula, ω1And the specific value of omega 2 can be customized or adjusted by a user according to actual requirements.
In the present disclosure, after the server calculates the weighted sum of all candidate lemmas through the above weighting formula, the calculated weighted sum of all candidate lemmas may be sorted according to the magnitude of the value, and then the server may extract m candidate lemmas with the highest weighted sum among all candidate lemmas as new candidate words or new candidate phrases based on the sorting. The value of m may be set by a user according to actual requirements, for example, the top 10 candidate lemmas with the highest weighted sum value may be extracted.
In the above embodiment, the two indexes of the degree of aggregation and the degree of freedom are used for weighting, so that the non-detachable degree and the flexible collocation degree of the candidate lemmas can be better measured, but for the commodity name of the e-commerce website and the search log of the user, a plurality of special brand names and proper nouns are contained in the commodity name and the search log, so that only the indexes of the degree of aggregation and the degree of freedom may omit words which are relatively high in occurrence frequency, but are easy to be detached or low in degree of freedom, and the words may be proper nouns of the e-commerce website.
Therefore, in the present disclosure, in order to make the extraction of the candidate word or the candidate phrase more accurate, when performing the above weighting calculation, in addition to weighting using two indexes, i.e., the degree of aggregation and the degree of freedom, the probability of occurrence of the candidate lemma may be introduced into the above weighting calculation as a correction parameter, so that it may be avoided that some words, which have relatively high frequency of occurrence but are easily split or have low degree of freedom, may be missed when performing the new word extraction by using the weighting result.
In another embodiment shown in the present disclosure, when the server performs weighting calculation on the aggregation degree and the degree of freedom of each candidate lemma, the weighting calculation may be performed by the following weighting formula:
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
in the above formula, F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
In the present disclosure, when the probability of occurrence of the candidate lemma is used as the correction parameter, C (ω) in the above formula may then represent the probability of occurrence of the candidate lemma, ω3It represents the weight ratio set by the user for the candidate lemma.
It should be noted that the aggregation, the degree of freedom, and the weight ratio of the occurrence probability of the candidate lemmas in the above formula can be set and adjusted by the user according to the actual needs. In practical application, the total length of the corpus used by the server may also have a certain influence on the result of the weighting calculation; for example, when the total length of the corpus is insufficient, the calculated aggregation degree and the degree of freedom of the candidate lemmas may be inaccurate, and therefore, in this case, the weight ratio of the probability of occurrence of the candidate lemmas is appropriately adjusted using the probability of occurrence of the candidate lemmas as the correction parameter, and the final calculated weight sum result may be corrected.
In this disclosure, when the weight ratio of the occurrence probability of the candidate lemma is adjusted, the server may compare the total length of the corpus with a preset threshold, and determine whether the current total length of the corpus is greater than the preset threshold, and if the total length of the corpus is greater than the preset threshold, the corpus is sufficient, and the aggregation and the degree of freedom of the candidate lemma calculated by the server are generally accurate, so the server may appropriately reduce the weight ratio of the occurrence probability of the candidate lemma based on a preset amplitude. In the same way, if the total length of the corpus is less than the preset threshold, the corpus may be insufficient at this time, and the aggregation and the degree of freedom of the candidate lemmas calculated by the server are usually inaccurate, so the server can appropriately reduce the weight proportion of the occurrence probability of the candidate lemmas based on the preset amplitude, thereby being helpful to reduce the inaccurate aggregation and degree of freedom caused by the insufficiency of the corpus and correcting the finally calculated weighted sum to the maximum extent.
The preset amplitude may be set according to an actual length of the corpus total length exceeding or falling below the preset threshold, that is, a higher adjustment amplitude may be set when the actual length of the corpus total length exceeding or falling below the threshold is larger, and conversely, a smaller adjustment amplitude may be set when the actual length of the corpus total length exceeding or falling below the threshold is smaller.
For example, the server may divide the actual length of the corpus total length exceeding the threshold into a plurality of levels, and set a corresponding adjustment amplitude for each level, wherein the more the total length exceeds, the greater the adjustment amplitude is. When the adjustment amplitude is determined, the total length of the corpus may exceed the actual length of the preset threshold, and matching may be performed in the multiple levels, and if one of the levels is matched, the weight ratio of the occurrence probability of the candidate lemma may be adjusted based on the adjustment amplitude corresponding to the level.
After the server calculates the weighted sum of all candidate lemmas through the weighting formula, the server can still rank the weighted sum of all the candidate lemmas obtained through calculation according to the numerical value, and then the server can extract m candidate lemmas with the highest weighted sum of all the candidate lemmas as new candidate words or new candidate phrases based on the ranking, and the specific process is not repeated.
In the disclosure, after the server extracts new candidate words or candidate phrases from all the candidate lemmas, the extracted candidate words or candidate phrases may be output in a manual review interface, so that a reviewer performs manual review.
In the manual review page, each output candidate word or candidate phrase can be set to pass and reject review states in decibels. When the output candidate word or candidate word is approved by the auditor, the candidate word can be set to a "pass state". In addition, in the manual review page, a user option for setting an import directory for the candidate words or candidate phrases that have been reviewed may be provided, and the reviewer may set an import directory for the candidate words or candidate phrases that have been reviewed by operating the user option. For example, for the search server, the candidate words or candidate phrases that have passed the review may be imported into the word segmentation list or spelling suggestion of the search engine, so that the user options in the manual review interface may provide two import directories, word segmentation list and spelling suggestion.
For the candidate words or candidate phrases after the review is completed, the server can periodically read the candidate words or candidate phrases set to be in a "pass" state in the manual review interface, read the import directory set by the reviewer, and import the read candidate words or candidate phrases into the corresponding import directory, so that when the user searches again by using the search keywords containing the imported candidate words or candidate phrases, the server can correctly segment the keywords through the word segmentation word list, thereby outputting the matched search result for the user, and in the process of inputting the keywords by the user, the server can also output the imported candidate words or candidate phrases as spelling suggestions to the user, thereby improving the search experience of the user.
Therefore, through the method, the server can continuously acquire the linguistic data, and accurately extract new candidate words or candidate phrases from the linguistic data to enrich the word segmentation word list and spelling suggestions of the search engine, so that the search experience of a user can be continuously optimized, and the accuracy of a search result is improved.
In the above embodiments, the technical solution of the present disclosure is described by taking an application scenario of an e-commerce website as an example, and in practical application, the technical solution of the present disclosure may also be applied to other scenarios. When a person skilled in the art puts the above technical solution into practice in other scenarios, reference may be made to the descriptions in the above embodiments of the present disclosure, and details are not described in the present disclosure.
In the above embodiment, the degree of aggregation of a plurality of candidate lemmas is calculated; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.
As shown in fig. 2, fig. 2 is another new word extraction method according to an exemplary embodiment, where the new word extraction method is used in a test terminal, and includes the following steps:
in step 201, corpora are obtained.
In step 202, the obtained corpus is subjected to lemma splitting to obtain the multiple candidate lemmas.
In step 203, the occurrence frequency of the candidate lemmas in all corpora is counted.
In step 204, the candidate lemmas are sequentially selected as target candidate lemmas, and the degree of aggregation of the selected target candidate lemmas is calculated according to an aggregation calculation formula.
Wherein, the calculation formula of the degree of agglomeration is as follows:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
or,
S ( A 1 A 2 ... A n ) = max { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
said S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); the P (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora; the P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora; the occurrence probability is the ratio of the occurrence times in all the corpora to the total length of the corpora; the value of n represents the number of characters or letters constituting the candidate lemma.
In step 205, the adjacent words of the candidate lemmas are recorded, and the occurrence frequency of the adjacent words of the candidate lemmas in all the corpora is counted.
In step 206, the candidate lemmas are sequentially selected as target candidate lemmas, and the occurrence probability of the adjacent words of the selected target candidate lemmas in all the corpora is calculated.
In step 207, the information entropy of the adjacent word is calculated based on the calculated occurrence probability.
In step 208, the information entropies of all the adjacent words of the target candidate lemma calculated are added to obtain the degree of freedom of the target candidate lemma.
In step 209, the aggregation and the degree of freedom of each candidate lemma in the plurality of candidate lemmas are weighted according to a preset weighting formula to obtain a weighted sum.
The preset weighting formula is as follows:
F(ω)=S(ω)*ω1+I(ω)*ω2
or,
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
In step 210, the weighted sum obtained by calculation is sorted according to the magnitude of the value, and m candidate lemmas with the highest weighted sum among the candidate lemmas are extracted as candidate words or candidate phrases based on the sorting.
In step 211, the extracted candidate words or candidate phrases are output in a manual review interface, and the candidate words or candidate phrases that have been reviewed in the manual review interface are imported into a spelling suggestion or word segmentation vocabulary of a search engine.
The server side can comprise a server, a server cluster or a cloud platform constructed based on the server cluster, wherein the server provides search services for users. For example, in an application scenario of an e-commerce website, the server may be a server providing a commodity search service to users. The server can split the keywords input by the user through a word segmentation word list in the search engine, and searches and outputs matched commodities for the user in the commodities to be sold according to the split lemmas.
The technical solution of the present disclosure is described in detail below with an application scenario of an e-commerce website, in which the server is a server that provides a commodity search service for a user as an example.
In an application scenario of an e-commerce network, the candidate lemma may be a candidate word or a candidate phrase obtained by a server splitting the lemma for the obtained corpus.
The corpus may include names of commodities and search logs of users. When the server acquires the corpus, the name of the commodity to be sold on the new shelf and the search log of the user can be periodically extracted.
For the commodity names, the server may periodically extract the commodity names under different commodity classifications, and then may write the extracted commodity names into different texts according to the commodity classifications in a certain order, for example, each line in the text corresponds to one commodity name. For the search log of the user, the server can record the search words input by each search operation of the user, and then write the same search words in the recorded search words into the text according to a certain sequence after merging the same search words; for example, there may still be one search term per line in the text.
In the present disclosure, the obtained corpus may be preprocessed by the server. When preprocessing the corpus, the server can split the obtained corpus into a plurality of candidate lemmas. The specific splitting mode of the server when performing lemma splitting on the material can be different splitting modes.
In an embodiment shown in the present disclosure, the server may set a maximum splitting length and a minimum splitting length, both of which may be set by a user in a customized manner, where the maximum splitting length is a maximum length of the split lemmas, and the minimum splitting length is a minimum length of the split lemmas. The minimum splitting length can be set to be 2, so that the situation that single words do not exist in the finally split lemmas is guaranteed.
When the minimum splitting length and the maximum splitting length are set, a length range of the lexeme splitting can be formed, and when the server splits the lexemes, the server can split the lexemes according to each splitting length value in the length range of the lexeme splitting so as to split the lexemes into a plurality of candidate lexemes.
For example, assuming that the corpus acquired by the server includes a commodity name "millet mobile phone version", the maximum splitting length set by the user is 4, and the minimum splitting length is 4, when the server splits the lemma of the commodity name, the server may split the commodity name into three candidate lemmas, namely "millet", "mobile phone" and "version", according to the minimum splitting length 2, wherein the candidate lemma "version" may be discarded because the length is less than 2. After the name of the commodity is split according to the minimum splitting length 2, the name of the commodity can be continuously split into a small hand, a mobile phone and a version according to the splitting length 3. After the name of the commodity is split according to the splitting length 3, the name of the commodity can be continuously split into two morphemes of 'millet mobile phone' and 'mobile version' according to the maximum splitting length 4. At this time, the splitting of the commodity name is completed, and the finally split candidate lemmas include 6 candidate lemmas such as "millet", "mobile phone", "millet hand", "mobile phone", "millet mobile phone", and "mobile version".
After the server completes the splitting of the obtained corpus according to the above mode, a plurality of candidate lemmas can be obtained. At this time, the server may record the adjacent words of each candidate lemma respectively, and count the recorded adjacent words of each candidate lemma and the occurrence number of each candidate lemma in all the corpora.
The adjacent characters of each candidate word element comprise left adjacent characters and right adjacent characters, one candidate word element may have a plurality of adjacent words, and for the candidate word elements which can be flexibly matched with other fixed words or fixed word groups, the adjacent characters of the candidate word element may be richer; for example, for the candidate word element "millet", it may be configured with "box" as a fixed phrase "millet box", or with "mobile phone" as a fixed phrase "millet mobile phone". Thus, for a candidate token, the more words or phrases it can match, the more the candidate token is adjacent.
When recording the adjacent words of each candidate lemma, the server can realize the reverse search in the corpus to which the candidate lemma belongs. For example, suppose the corpus is the name of a commodity, "millet mobile phone version", the candidate lemma is "mobile phone", and the left side adjacent word of the candidate lemma is the character "meter", and the right side adjacent word is the character "move". Of course, for the lemmas without left or right adjacent words, such as candidate lemmas "millet" and "mobile version", the server may mark the left or right non-existing adjacent words in the record when recording the adjacent words.
When the server counts the occurrence frequency of the candidate lemma or the adjacent word in all the linguistic data, the server can scan the candidate lemma or the adjacent word in all the linguistic data as an index, and the occurrence frequency is increased by one every time the candidate lemma or the adjacent word is scanned. For example, assuming that the candidate lemma is "millet", the server may scan the commodity name and the user search log in all the corpora using "millet" as a keyword, and may increment the number of occurrences of the lemma by one every time the scan is completed. Assuming that the adjacent word is'm', the server can scan the commodity name and the user search log in all the linguistic data by taking'm' as a key word, and the occurrence number of the lemma can be increased by one every time the commodity name and the user search log are scanned.
For the recorded adjacent words of each candidate lemma, the counted occurrence frequency of each candidate lemma and the adjacent words of each candidate lemma in all the linguistic data can be stored in the same text together with the candidate lemma by the server. For example, the server may store the candidate lemmas in text, where each line of the text corresponds to a respective candidate lemma. The server can store the recorded adjacent characters, the counted occurrence frequency of the candidate word element and the candidate word element in the same line, so that the server can conveniently read the candidate word element.
The foregoing describes a preprocessing process of the server on the split candidate lemmas.
In this disclosure, after the server finishes the preprocessing process of the split candidate lemmas, the server may calculate the aggregation and the degree of freedom of each candidate lemma based on the counted parameters such as the number of occurrences of the candidate lemma, the adjacent words of the candidate lemma, and the number of occurrences of the adjacent words.
Wherein, the degree of aggregation characterizes the probability of the candidate word element as a fixed word or a fixed phrase. The indicator can be used for evaluating whether a lemma is a common fixed collocation, namely whether characters forming the lemma are accidentally pieced together. For example, if character a is unrelated to character B, then the probability that they will accidentally form a word AB is p (a) x p (B), where p (a) represents the probability that a appears in the corpus. If p (AB) is calculated to be much greater than p (a) p (B), then it can be assumed that the word AB is not a random spelling of the character a and the character B, but rather a fixed word collocation.
Based on this, in one embodiment shown, the server, when calculating the degree of aggregation of the candidate lemmas, may perform the calculation by the following degree of aggregation calculation formula:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
in the above formula, S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); p (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora. Accordingly, P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora, by analogy, P (A)1A2) And P (A)3…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1A2]And a character string [ A3…An]Probability of occurrence in all corpora.
In the present disclosure, the occurrence probability may be expressed by a ratio of the occurrence number in all the corpuses to the total length of the corpuses; for example, when calculating the probability of occurrence of a candidate lemma, the server may use the ratio of the number of occurrences of the candidate lemma to the total length of the corpus, which is obtained by statistics in the preprocessing stage of the corpus. When calculating the occurrence probability of the adjacent word, the server may use the ratio of the number of occurrences of the adjacent word counted in the preprocessing stage of the corpus to the total length of the corpus to represent the adjacent word. The value of n in the above formula represents the number of characters or letters of the candidate lemma. For example, assume that the candidate lemma is "millet" when the n value is 2.
When the server calculates the aggregation of the candidate lemmas by using the above formula, the server may sequentially select all the candidate lemmas obtained by splitting as target candidate lemmas, and then calculate the aggregation of all the candidate lemmas obtained after splitting by using the above formula for each selected target candidate lemma.
When the server calculates the degree of aggregation of the target candidate lemmas, the server can firstly continue to split the candidate lemmas. To calculate candidate lemma [ A ]1A2A3A4]For example, the server may split the candidate token into [ A ]1]And [ A ]2A3A4]、[A1A2]And [ A ]3A4]And [ A1A2A3]And [ A ]4]And 6 combinations.
When the split is complete, the server can compute [ A ] separately1]、[A2A3A4]、[A1A2]、[A3A4]、[A1A2A3]And [ A ]4]Probability of occurrence in all corpora. Wherein, the candidate lemma [ A ] is calculated1A2A3A4]Check out of [ A ]1]、[A2A3A4]、[A1A2]、[A3A4]、[A1A2A3]And [ A ]4]The specific process of occurrence probability and the calculation of candidate lemma [ A ]1A2A3A4]The specific process of occurrence probability of (c) is the same. For example, the server is calculating [ A ]1]Probability of occurrence of P (A)1) Then [ A ] can be counted again1]And calculating the ratio of the counted occurrence times to the total length of the linguistic data.
When the server calculates the split [ A ]1]And [ A ]2A3A4]、[A1A2]And [ A ]3A4]And [ A1A2A3]And [ A ]4]After the occurrence probabilities of the 6 combinations are equal, the calculated occurrence probabilities can be substituted into the above formula to calculate the candidate lemma [ A1A2A3A4]Degree of aggregation of (a).
For example, assuming that the candidate lemma is "millet cell phone", the server may split the candidate lemma into 6 combinations of "small", "rice cell phone", "millet", "cell phone", "millet cell phone", and "machine", instead of calculating the occurrence probabilities of the six combinations in all corpora, and then substituting the calculated occurrence probabilities into the above formula for calculation. When the calculated occurrence probabilities of the 6 combinations are substituted into the formula, 3 groups of values can be obtained.
The value of group 1 is the ratio of the occurrence probability of the candidate lemma "millet cell phone" and the product of the occurrence probabilities of "small" and "rice cell phone" in the combination of the above 6.
The value of group 2 is the ratio of the occurrence probability of the candidate lemma "millet mobile phone" and the product of the occurrence probabilities of "millet" and "mobile phone" in the combination of the above 6.
The value of group 3 is the ratio of the occurrence probability of the candidate lemma 'millet mobile phone' and the product of the occurrence probabilities of the 'millet mobile phone' and the 'mobile phone' in the combination of the group 6.
The minimum value of the 3 sets of values can be taken as the degree of aggregation of the candidate lemma "millet mobile phone" based on the above formula.
Note that, in implementation, the maximum value among the 3 sets of values may be used as the degree of aggregation of the candidate token "millet mobile phone", and is not particularly limited in the present disclosure.
Therefore, in another embodiment shown, the above formula for calculating the degree of agglomeration may also be expressed as the following formula, and the specific calculation process is not described again:
S ( A 1 A 2 ... A n ) = max { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
the above describes the calculation process of the degree of aggregation of the candidate lemmas, and in this way, the degree of aggregation of each candidate lemma can be calculated.
In the present disclosure, after the server calculates the degree of aggregation of each candidate lemma, the server may calculate the degree of freedom of each candidate lemma based on the parameters of the adjacent word of the candidate lemma that have been counted, the number of occurrences of the adjacent word, and the like.
The degree of freedom represents the flexibility of matching the candidate word element with the fixed word or the fixed phrase. The index is used for measuring whether a candidate lemma can flexibly appear in various different contexts, namely whether the left and right collocation of the candidate lemma is flexible, if the freedom degree value of the candidate lemma is low, the new word is possibly only one part of a certain fixed word collocation, and the higher the freedom degree value of the candidate lemma is, the more fixed words or fixed word groups can be collocated with the candidate lemma. For example, if the candidate lemma is "millet", if the candidate lemma appears 363 times in all the names of goods and 17 different words or phrases can be collocated on the right side of the candidate lemma, the degree of freedom of the word "millet" can be considered to be high.
In this disclosure, when calculating the degrees of freedom of the candidate lemmas, the server may sequentially select all the candidate lemmas obtained by splitting as target candidate lemmas, and then calculate the degrees of freedom for each selected target candidate lemma, so that the degrees of freedom of all the candidate lemmas obtained after splitting may be calculated.
When the server calculates the degree of freedom of the target candidate lemma, the server may first read the neighboring words of the target candidate lemma recorded in the corpus preprocessing stage, at this time, the target candidate lemma may have a plurality of neighboring words, and the server may calculate the information entropy of each neighboring word of the target candidate lemma respectively.
When the information entropy of the adjacent word is obtained, the information entropy can be obtained by the following calculation formula of the information entropy:
I(ω)=-ΣP*log(p)
in the above formula, I (ω) represents the calculated information entropy. P represents the probability of occurrence of adjacent words of the target candidate lemma in all corpora. The base of the log function in the above formula is not particularly limited in this disclosure, and may be 2 or other values, and those skilled in the art may refer to the description in the prior art.
After the server calculates the information entropies of all the adjacent words of the target candidate lemma through the above formula, the information entropies of many adjacent words of the target candidate lemma can be added, and then the added sum is used as the degree of freedom of the target candidate lemma. It can be seen that if the number of adjacent words of the target candidate lemma is more, the degree of freedom of the target candidate lemma calculated finally is higher, and at this time, the number of fixed words or fixed word groups that can be collocated with the target candidate lemma is more.
The above describes the calculation process of the degrees of freedom of the candidate lemmas, and in this way, the degree of freedom of each candidate lemma can be calculated.
In the disclosure, after the server calculates the aggregation and the degree of freedom of each candidate lemma in all the split candidate lemmas, the server may perform weighted calculation on the aggregation and the degree of freedom of each candidate lemma to obtain weighted sums, and then may extract a new candidate word or a new candidate phrase from all the candidate lemmas according to the weighted sums obtained by the calculation
In an embodiment shown, when the server performs weighting calculation on the aggregation degree and the degree of freedom of each candidate lemma respectively, the weighting calculation may be performed by the following weighting formula:
F(ω)=S(ω)*ω1+I(ω)*ω2
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter. Wherein, in the above formula, ω1And the specific value of omega 2 can be self-taken by the user according to the actual requirementDefined or adjusted.
In the present disclosure, after the server calculates the weighted sum of all candidate lemmas through the above weighting formula, the calculated weighted sum of all candidate lemmas may be sorted according to the magnitude of the value, and then the server may extract m candidate lemmas with the highest weighted sum among all candidate lemmas as new candidate words or new candidate phrases based on the sorting. The value of m may be set by a user according to actual requirements, for example, the top 10 candidate lemmas with the highest weighted sum value may be extracted.
In the above embodiment, the two indexes of the degree of aggregation and the degree of freedom are used for weighting, so that the non-detachable degree and the flexible collocation degree of the candidate lemmas can be better measured, but for the commodity name of the e-commerce website and the search log of the user, a plurality of special brand names and proper nouns are contained in the commodity name and the search log, so that only the indexes of the degree of aggregation and the degree of freedom may omit words which are relatively high in occurrence frequency, but are easy to be detached or low in degree of freedom, and the words may be proper nouns of the e-commerce website.
Therefore, in the present disclosure, in order to make the extraction of the candidate word or the candidate phrase more accurate, when performing the above weighting calculation, in addition to weighting using two indexes, i.e., the degree of aggregation and the degree of freedom, the probability of occurrence of the candidate lemma may be introduced into the above weighting calculation as a correction parameter, so that it may be avoided that some words, which have relatively high frequency of occurrence but are easily split or have low degree of freedom, may be missed when performing the new word extraction by using the weighting result.
In another embodiment shown in the present disclosure, when the server performs weighting calculation on the aggregation degree and the degree of freedom of each candidate lemma, the weighting calculation may be performed by the following weighting formula:
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
in the above-mentioned formula, the first and second,f (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
In the present disclosure, when the probability of occurrence of the candidate lemma is used as the correction parameter, C (ω) in the above formula may then represent the probability of occurrence of the candidate lemma, ω3It represents the weight ratio set by the user for the candidate lemma.
It should be noted that the aggregation, the degree of freedom, and the weight ratio of the occurrence probability of the candidate lemmas in the above formula can be set and adjusted by the user according to the actual needs. In practical application, the total length of the corpus used by the server may also have a certain influence on the result of the weighting calculation; for example, when the total length of the corpus is insufficient, the calculated aggregation degree and the degree of freedom of the candidate lemmas may be inaccurate, and therefore, in this case, the weight ratio of the probability of occurrence of the candidate lemmas is appropriately adjusted using the probability of occurrence of the candidate lemmas as the correction parameter, and the final calculated weight sum result may be corrected.
In this disclosure, when the weight ratio of the occurrence probability of the candidate lemma is adjusted, the server may compare the total length of the corpus with a preset threshold, and determine whether the current total length of the corpus is greater than the preset threshold, and if the total length of the corpus is greater than the preset threshold, the corpus is sufficient, and the aggregation and the degree of freedom of the candidate lemma calculated by the server are generally accurate, so the server may appropriately reduce the weight ratio of the occurrence probability of the candidate lemma based on a preset amplitude. In the same way, if the total length of the corpus is less than the preset threshold, the corpus may be insufficient at this time, and the aggregation and the degree of freedom of the candidate lemmas calculated by the server are usually inaccurate, so the server can appropriately reduce the weight proportion of the occurrence probability of the candidate lemmas based on the preset amplitude, thereby being helpful to reduce the inaccurate aggregation and degree of freedom caused by the insufficiency of the corpus and correcting the finally calculated weighted sum to the maximum extent.
The preset amplitude may be set according to an actual length of the corpus total length exceeding or falling below the preset threshold, that is, a higher adjustment amplitude may be set when the actual length of the corpus total length exceeding or falling below the threshold is larger, and conversely, a smaller adjustment amplitude may be set when the actual length of the corpus total length exceeding or falling below the threshold is smaller.
For example, the server may divide the actual length of the corpus total length exceeding the threshold into a plurality of levels, and set a corresponding adjustment amplitude for each level, wherein the more the total length exceeds, the greater the adjustment amplitude is. When the adjustment amplitude is determined, the total length of the corpus may exceed the actual length of the preset threshold, and matching may be performed in the multiple levels, and if one of the levels is matched, the weight ratio of the occurrence probability of the candidate lemma may be adjusted based on the adjustment amplitude corresponding to the level.
After the server calculates the weighted sum of all candidate lemmas through the weighting formula, the server can still rank the weighted sum of all the candidate lemmas obtained through calculation according to the numerical value, and then the server can extract m candidate lemmas with the highest weighted sum of all the candidate lemmas as new candidate words or new candidate phrases based on the ranking, and the specific process is not repeated.
In the disclosure, after the server extracts new candidate words or candidate phrases from all the candidate lemmas, the extracted candidate words or candidate phrases may be output in a manual review interface, so that a reviewer performs manual review.
In the manual review page, each output candidate word or candidate phrase can be set to pass and reject review states in decibels. When the output candidate word or candidate word is approved by the auditor, the candidate word can be set to a "pass state". In addition, in the manual review page, a user option for setting an import directory for the candidate words or candidate phrases that have been reviewed may be provided, and the reviewer may set an import directory for the candidate words or candidate phrases that have been reviewed by operating the user option. For example, for the search server, the candidate words or candidate phrases that have passed the review may be imported into the word segmentation list or spelling suggestion of the search engine, so that the user options in the manual review interface may provide two import directories, word segmentation list and spelling suggestion.
For the candidate words or candidate phrases after the review is completed, the server can periodically read the candidate words or candidate phrases set to be in a "pass" state in the manual review interface, read the import directory set by the reviewer, and import the read candidate words or candidate phrases into the corresponding import directory, so that when the user searches again by using the search keywords containing the imported candidate words or candidate phrases, the server can correctly segment the keywords through the word segmentation word list, thereby outputting the matched search result for the user, and in the process of inputting the keywords by the user, the server can also output the imported candidate words or candidate phrases as spelling suggestions to the user, thereby improving the search experience of the user.
Therefore, through the method, the server can continuously acquire the linguistic data, and accurately extract new candidate words or candidate phrases from the linguistic data to enrich the word segmentation word list and spelling suggestions of the search engine, so that the search experience of a user can be continuously optimized, and the accuracy of a search result is improved.
In the above embodiments, the technical solution of the present disclosure is described by taking an application scenario of an e-commerce website as an example, and in practical application, the technical solution of the present disclosure may also be applied to other scenarios. When a person skilled in the art puts the above technical solution into practice in other scenarios, reference may be made to the descriptions in the above embodiments of the present disclosure, and details are not described in the present disclosure.
In the above embodiment, the degree of aggregation of a plurality of candidate lemmas is calculated; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.
Corresponding to the embodiment of the new word extraction method, the disclosure also provides an embodiment of a new word extraction device.
Fig. 3 is a schematic block diagram illustrating a new word extraction apparatus according to an exemplary embodiment.
As shown in fig. 3, a new word extraction apparatus 30 according to an exemplary embodiment is shown, including: a first calculation module 301, a second calculation module 302, a third calculation module 303 and an extraction module 304; wherein:
the first calculation module 301 is configured to calculate the aggregation degrees of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
the second calculation module 302 configured to calculate degrees of freedom of the plurality of candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
the third calculating module 303 is configured to perform weighted calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculating module and the second calculating module to obtain a weighted sum;
the extracting module 304 is configured to extract candidate words or candidate phrases from the candidate lemmas based on the weighted sum calculated by the third calculating module.
In the above embodiment, the degree of aggregation of a plurality of candidate lemmas is calculated; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.
Referring to fig. 4, fig. 4 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, and based on the foregoing embodiment shown in fig. 3, the apparatus 30 may further include an obtaining module 305 and a splitting module 306; wherein:
the obtaining module 305 is configured to obtain corpora;
the splitting module 306 is configured to split the morpheme of the corpus acquired by the acquiring module to obtain the multiple candidate morphemes.
The corpus comprises commodity names and user search logs.
Referring to fig. 5, fig. 5 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, where on the basis of the foregoing embodiment shown in fig. 4, the first calculating module 301 may include a first statistical submodule 301A, a first selected submodule 301B, and a first calculating submodule 301C; wherein:
the first statistical submodule 301A is configured to count the occurrence frequency of the candidate lemmas in all corpora;
the first selecting sub-module 301B is configured to sequentially select the candidate lemmas as target candidate lemmas;
the first calculating submodule 301C is configured to calculate the degree of aggregation of the target candidate lemma selected by the first selected submodule according to an aggregation calculation formula.
Wherein, the calculation formula of the degree of agglomeration is as follows:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
or,
S ( A 1 A 2 ... A n ) = max { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
said S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); the P (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora; the P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora; the occurrence probability is the ratio of the occurrence times in all the corpora to the total length of the corpora; the value of n represents the number of characters or letters constituting the candidate lemma.
It should be noted that, in the embodiment of the apparatus shown in fig. 5, the first statistic submodule 301A, the first selecting submodule 301B, and the first calculating submodule 301C are shown; the structure of (2) can also be included in the aforementioned embodiment of the apparatus of fig. 3, and the present disclosure is not limited thereto.
Referring to fig. 6, fig. 6 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, which is based on the foregoing embodiment shown in fig. 5, in which the second calculating module 302 may include a second statistics sub-module 302A, a second selected sub-module 302B, a second calculating sub-module 302C, a third calculating sub-module 302D, and an adding sub-module 302E; wherein:
the second statistical submodule 302A is configured to record adjacent words of the plurality of candidate lemmas, and count the occurrence number of the adjacent words of the plurality of candidate lemmas in all corpora;
the second selecting sub-module 302B is configured to sequentially select the candidate lemmas as target candidate lemmas;
the second calculating submodule 302C is configured to calculate the occurrence probability of the adjacent word of the target candidate lemma selected by the second selected submodule in all the corpora;
the third computation submodule 302D is configured to compute an information entropy of the adjacent word based on the occurrence probability computed by the second computation submodule;
the adding sub-module 302E is configured to add the information entropies of all the adjacent words of the target candidate lemma calculated by the third calculation module to obtain the degree of freedom of the target candidate lemma.
It should be noted that, the second calculation module 302 shown in the apparatus embodiment shown in fig. 6 may include structures of a second statistics sub-module 302A, a second selection sub-module 302B, a second calculation sub-module 302C, a third calculation sub-module 302D, and an addition sub-module 302E, which may also be included in the apparatus embodiments of fig. 3 to 4, and the disclosure is not limited thereto.
Referring to fig. 7, fig. 7 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, where on the basis of the foregoing embodiment shown in fig. 6, the third calculating module 303 may include a weighting sub-module 303A; wherein:
the weighting submodule 303A is configured to perform weighting calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculation module and the second calculation module according to a preset weighting formula to obtain a weighted sum;
the preset weighting formula is as follows:
F(ω)=S(ω)*ω1+I(ω)*ω2
or,
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
It should be noted that the structure of the weighting sub-module 303A shown in the apparatus embodiment shown in fig. 7 may also be included in the apparatus embodiments of fig. 3 to 5, and the disclosure is not limited thereto.
Referring to fig. 8, fig. 8 is a block diagram of another new word extraction apparatus according to an exemplary embodiment of the present disclosure, where the embodiment is based on the foregoing embodiment shown in fig. 7, and the modification parameter is the probability of occurrence of the candidate lemma in all corpora; the third calculation module 303 may further include a judgment sub-module 303B, an increase sub-module 303C, and a decrease sub-module 303D; wherein:
the judgment submodule 303B is configured to judge whether the total length of the corpus is greater than a preset threshold;
the increasing submodule 303C is configured to increase, based on a preset amplitude, a weight ratio of occurrence probabilities of the candidate lemmas in all the corpuses when the total length of the corpuses is lower than a preset threshold;
the reducing submodule 303D is configured to reduce, when the total length of the corpuses is greater than a preset threshold, a weight ratio of occurrence probabilities of the candidate lemmas in all the corpuses based on a preset amplitude.
It should be noted that, the structures of the judgment sub-module 303B, the increase sub-module 303C and the decrease sub-module 303D shown in the apparatus embodiment shown in fig. 8 may also be included in the apparatus embodiments of fig. 3 to 7, and the disclosure is not limited thereto.
Referring to fig. 9, fig. 9 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, which is based on the foregoing embodiment shown in fig. 3, in which the extracting module 304 may include a sorting sub-module 304A and an extracting sub-module 304B; wherein:
the sorting submodule 304A is configured to sort the weighted sum calculated by the third calculation module according to the numerical value;
the extracting sub-module 304B is configured to extract m candidate lemmas with the highest weighted sum among the candidate lemmas as candidate words or candidate phrases based on the ranking; wherein the value of m is set by a user.
It should be noted that the structures of the sorting sub-module 304A and the extracting sub-module 304B shown in the apparatus embodiment shown in fig. 9 may also be included in the apparatus embodiments of fig. 4 to 8, and the disclosure is not limited thereto.
Referring to fig. 10, fig. 10 is a block diagram of another new word extracting apparatus shown in the present disclosure according to an exemplary embodiment, in which on the basis of the foregoing embodiment shown in fig. 9, the extracting module 304 may further include an output sub-module 304C and an import sub-module 304D; wherein:
the output sub-module 304C is configured to output the candidate words or candidate phrases extracted by the extraction sub-module on a manual review interface;
the import sub-module 304D is configured to audit the passed candidate words or candidate word groups in the manual audit interface and import spelling suggestions or word segmentation word lists of a search engine.
It should be noted that the structures of the output sub-module 304C and the import sub-module 304D shown in the apparatus embodiment shown in fig. 10 may also be included in the apparatus embodiments of fig. 3 to 8, and the disclosure is not limited thereto.
The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.
Correspondingly, this disclosure still provides a new word extraction element, the device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
Accordingly, the present disclosure also provides a server comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
Correspondingly, this disclosure still provides a new word extraction element, the device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
Fig. 11 is a block diagram illustrating an apparatus 1100 for new word extraction according to an example embodiment. For example, the apparatus 1100 may be provided as a server.
Referring to fig. 11, the apparatus 1100 includes a processing component 1122 that further includes one or more processors and memory resources, represented by memory 1132, for storing instructions, such as application programs, executable by the processing component 1122. The application programs stored in memory 1132 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1122 is configured to execute instructions to perform the device binding methods described above.
The apparatus 1100 may also include a power component 1126 configured to perform power management of the apparatus 1100, a wired or wireless network interface 1150 configured to connect the apparatus 1100 to a network, and an input/output (I/O) interface 1158. The device 1100 may operate based on an operating system stored in memory 1132, such as a Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (19)

1. A method for extracting new words, the method comprising:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
2. The method of claim 1, wherein the method further comprises:
obtaining a corpus;
and carrying out lexical element splitting on the obtained corpus to obtain the candidate lexical elements.
3. The method of claim 2, wherein the corpus comprises names of goods and user search logs.
4. The method of claim 2, wherein said calculating the degree of aggregation of the plurality of candidate lemmas comprises:
counting the occurrence times of the candidate lemmas in all the linguistic data;
sequentially selecting the candidate word elements as target candidate word elements;
and calculating the degree of cohesion of the selected target candidate lemmas according to a degree of cohesion calculation formula.
Wherein, the calculation formula of the degree of agglomeration is as follows:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
or,
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
said S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); the P (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora; the P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora; the occurrence probability is the ratio of the occurrence times in all the corpora to the total length of the corpora; the value of n represents the number of characters or letters constituting the candidate lemma.
5. The method of claim 4, wherein said computing degrees of freedom for a plurality of candidate tokens comprises:
recording adjacent characters of the candidate lemmas, and counting the occurrence times of the adjacent characters of the candidate lemmas in all linguistic data;
sequentially selecting the candidate word elements as target candidate word elements;
calculating the occurrence probability of the adjacent characters of the selected target candidate lemmas in all linguistic data;
calculating information entropy of the adjacent word based on the calculated occurrence probability;
and adding the calculated information entropies of all the adjacent words of the target candidate lemma to obtain the degree of freedom of the target candidate lemma.
6. The method of claim 5, wherein the weighted computing of the calculated degree of aggregation and the degree of freedom for each of the plurality of candidate tokens, respectively, to obtain a weighted sum comprises:
respectively carrying out weighted calculation on the aggregation degree and the freedom degree of each candidate lemma in the candidate lemma according to a preset weighting formula to obtain a weighted sum;
the preset weighting formula is as follows:
F(ω)=S(ω)*ω1+I(ω)*ω2
or,
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
7. The method of claim 6, wherein the modification parameter is a probability of occurrence of the candidate lemma in all corpora;
the method further comprises the following steps:
judging whether the total length of the corpus is greater than a preset threshold value or not;
when the total length of the linguistic data is lower than a preset threshold value, the weight proportion of the occurrence probability of the candidate lemmas in all the linguistic data is improved based on a preset amplitude;
and when the total length of the linguistic data is greater than a preset threshold value, reducing the weight proportion of the occurrence probability of the candidate lemmas in all the linguistic data based on a preset amplitude.
8. The method of claim 6, wherein said extracting candidate words or candidate phrases from said plurality of candidate tokens based on said calculated weighted sum comprises:
sorting the weighted sums obtained by calculation according to the numerical value;
extracting m candidate lemmas with the highest weighted sum in the candidate lemmas as candidate words or candidate phrases based on the ordering;
wherein the value of m is set by a user.
9. The method of claim 8, wherein the method further comprises:
outputting the extracted candidate words or candidate phrases on a manual review interface;
and importing the candidate words or candidate phrases which are checked in the manual checking interface into spelling suggestions or word segmentation word lists of a search engine.
10. A new word extraction apparatus, characterized in that the apparatus comprises:
a first calculation module configured to calculate a degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
a second calculation module configured to calculate degrees of freedom of the plurality of candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
a third calculation module configured to perform weighted calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculation module and the second calculation module to obtain a weighted sum;
an extracting module configured to extract candidate words or candidate phrases from the plurality of candidate lemmas based on the weighted sum calculated by the third calculating module.
11. The apparatus of claim 10, wherein the apparatus further comprises:
an acquisition module configured to acquire a corpus;
and the splitting module is configured to split the morphemes of the corpus acquired by the acquiring module to obtain the candidate morphemes.
12. The apparatus of claim 11, wherein the corpus comprises names of goods and user search logs.
13. The apparatus of claim 11, wherein the first computing module comprises:
the first statistic submodule is configured to count the occurrence times of the candidate lemmas in all the linguistic data;
a first selection submodule configured to sequentially select the plurality of candidate lemmas as target candidate lemmas;
a first calculation submodule configured to calculate the degree of aggregation of the target candidate lemma selected by the first selected submodule according to an aggregation calculation formula.
Wherein, the calculation formula of the degree of agglomeration is as follows:
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
or,
S ( A 1 A 2 ... A n ) = m i n { P ( A 1 A 2 ... A n ) P ( A 1 ) * P ( A 2 ... A n ) , P ( A 1 A 2 ... A n ) P ( A 1 A 2 ) * P ( A 3 ... A n ) ... P ( A 1 A 2 ... A n ) P ( A 1 ... A n - 1 ) * P ( A n ) }
said S (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Degree of agglomeration of (a); the P (A)1A2…An) Represents candidate lemma [ A ]1A2…An]Probability of occurrence in all corpora; the P (A)1) And P (A)2…An) Respectively represent the word elements [ A ] from the candidate1A2…An]The split character [ A ] in1]And a character string [ A2…An]Probability of occurrence in all corpora; what is needed isThe occurrence probability is the ratio of the occurrence times in all the linguistic data to the total length of the linguistic data; the value of n represents the number of characters or letters constituting the candidate lemma.
14. The apparatus of claim 13, wherein the second computing module comprises:
the second statistic submodule is configured to record adjacent words of the candidate lemmas and count the occurrence times of the adjacent words of the candidate lemmas in all the linguistic data;
a second selection submodule configured to sequentially select the plurality of candidate lemmas as target candidate lemmas;
a second calculating submodule configured to calculate occurrence probabilities of adjacent words of the target candidate lemma selected by the second selecting submodule in all the corpora;
a third calculation submodule configured to calculate an information entropy of the adjacent word based on the occurrence probability calculated by the second calculation submodule;
an adding sub-module configured to add the information entropies of all the adjacent words of the target candidate lemma calculated by the third calculation module to obtain the degree of freedom of the target candidate lemma.
15. The apparatus of claim 14, wherein the third computing module comprises:
the weighting submodule is configured to perform weighting calculation on the aggregation degree and the freedom degree of each candidate lemma in the plurality of candidate lemmas calculated by the first calculation module and the second calculation module according to a preset weighting formula to obtain a weighted sum;
the preset weighting formula is as follows:
F(ω)=S(ω)*ω1+I(ω)*ω2
or,
F(ω)=S(ω)*ω1+I(ω)*ω2+C(ω)*ω3
wherein F (ω) represents the weighted sum; s (omega)) Representing the degree of aggregation of the candidate lemmas; omega1Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega2Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega3And expressing the weight proportion preset for the correction parameter.
16. The apparatus of claim 15, wherein the modification parameter is a probability of occurrence of the candidate lemma in all corpora;
the third computing module further comprises:
the judging submodule is configured to judge whether the total length of the corpus is greater than a preset threshold value;
the improvement submodule is configured to improve the weight proportion of the occurrence probability of the candidate lemma in all the corpora based on a preset amplitude when the total length of the corpora is lower than a preset threshold;
and the reducing submodule is configured to reduce the weight proportion of the occurrence probability of the candidate lemma in all the corpuses based on a preset amplitude when the total length of the corpuses is greater than a preset threshold value.
17. The apparatus of claim 15, wherein the extraction module comprises:
the sorting submodule is configured to sort the weighted sum calculated by the third calculation module according to the numerical value;
an extraction sub-module configured to extract m candidate lemmas with the highest weighted sum among the plurality of candidate lemmas as candidate words or candidate phrases based on the ranking; wherein the value of m is set by a user.
18. The apparatus of claim 17, wherein the extraction module further comprises:
the output sub-module is configured to output the candidate words or candidate phrases extracted by the extraction sub-module on a manual review interface;
and the import sub-module is configured to import the candidate words or candidate word groups which are approved in the manual review interface into spelling suggestions or word segmentation lists of a search engine.
19. A new word extraction device, characterized by comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;
calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;
respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;
and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.
CN201510729084.7A 2015-10-30 2015-10-30 New words extraction method and apparatus Active CN105260362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510729084.7A CN105260362B (en) 2015-10-30 2015-10-30 New words extraction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510729084.7A CN105260362B (en) 2015-10-30 2015-10-30 New words extraction method and apparatus

Publications (2)

Publication Number Publication Date
CN105260362A true CN105260362A (en) 2016-01-20
CN105260362B CN105260362B (en) 2019-02-12

Family

ID=55100058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510729084.7A Active CN105260362B (en) 2015-10-30 2015-10-30 New words extraction method and apparatus

Country Status (1)

Country Link
CN (1) CN105260362B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN109299230A (en) * 2018-09-06 2019-02-01 华泰证券股份有限公司 A kind of customer service public sentiment hot word data digging system and method
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109858010A (en) * 2018-11-26 2019-06-07 平安科技(深圳)有限公司 Field new word identification method, device, computer equipment and storage medium
CN109858023A (en) * 2019-01-04 2019-06-07 北京车慧科技有限公司 A kind of sentence error correction device
CN110110322A (en) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 Network new word discovery method, apparatus, electronic equipment and storage medium
CN110245343A (en) * 2018-03-07 2019-09-17 优酷网络技术(北京)有限公司 Barrage analysis method and device
CN111626054A (en) * 2020-05-21 2020-09-04 北京明亿科技有限公司 New illegal behavior descriptor identification method and device, electronic equipment and storage medium
CN111831832A (en) * 2020-07-27 2020-10-27 北京世纪好未来教育科技有限公司 Word list construction method, electronic device and computer readable medium
CN111898010A (en) * 2020-07-10 2020-11-06 时趣互动(北京)科技有限公司 New keyword mining method and device and electronic equipment
CN112182448A (en) * 2019-07-05 2021-01-05 百度在线网络技术(北京)有限公司 Page information processing method, device and equipment
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN113449082A (en) * 2021-07-16 2021-09-28 上海明略人工智能(集团)有限公司 New word discovery method, system, electronic device and medium
CN115034211A (en) * 2022-05-19 2022-09-09 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium
CN115146632A (en) * 2022-06-23 2022-10-04 淮阴工学院 New word recognition-based chemical field word segmentation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930055A (en) * 2012-11-18 2013-02-13 浙江大学 New network word discovery method in combination with internal polymerization degree and external discrete information entropy
CN103020022A (en) * 2012-11-20 2013-04-03 北京航空航天大学 Chinese unregistered word recognition system and method based on improvement information entropy characteristics
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930055A (en) * 2012-11-18 2013-02-13 浙江大学 New network word discovery method in combination with internal polymerization degree and external discrete information entropy
CN103020022A (en) * 2012-11-20 2013-04-03 北京航空航天大学 Chinese unregistered word recognition system and method based on improvement information entropy characteristics
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN106126495B (en) * 2016-06-16 2019-03-12 北京捷通华声科技股份有限公司 One kind being based on large-scale corpus prompter method and apparatus
CN110245343A (en) * 2018-03-07 2019-09-17 优酷网络技术(北京)有限公司 Barrage analysis method and device
CN109299230A (en) * 2018-09-06 2019-02-01 华泰证券股份有限公司 A kind of customer service public sentiment hot word data digging system and method
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109492224B (en) * 2018-11-07 2024-05-03 北京金山数字娱乐科技有限公司 Vocabulary construction method and device
CN109858010A (en) * 2018-11-26 2019-06-07 平安科技(深圳)有限公司 Field new word identification method, device, computer equipment and storage medium
CN109858023A (en) * 2019-01-04 2019-06-07 北京车慧科技有限公司 A kind of sentence error correction device
CN110110322A (en) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 Network new word discovery method, apparatus, electronic equipment and storage medium
CN112182448A (en) * 2019-07-05 2021-01-05 百度在线网络技术(北京)有限公司 Page information processing method, device and equipment
CN111626054A (en) * 2020-05-21 2020-09-04 北京明亿科技有限公司 New illegal behavior descriptor identification method and device, electronic equipment and storage medium
CN111626054B (en) * 2020-05-21 2023-12-19 北京明亿科技有限公司 Novel illegal action descriptor recognition method and device, electronic equipment and storage medium
CN111898010A (en) * 2020-07-10 2020-11-06 时趣互动(北京)科技有限公司 New keyword mining method and device and electronic equipment
CN111898010B (en) * 2020-07-10 2024-09-13 时趣互动(北京)科技有限公司 New keyword mining method and device and electronic equipment
CN111831832B (en) * 2020-07-27 2022-07-01 北京世纪好未来教育科技有限公司 Word list construction method, electronic device and computer readable medium
CN111831832A (en) * 2020-07-27 2020-10-27 北京世纪好未来教育科技有限公司 Word list construction method, electronic device and computer readable medium
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN112560448B (en) * 2021-02-20 2021-06-22 京华信息科技股份有限公司 New word extraction method and device
CN113449082A (en) * 2021-07-16 2021-09-28 上海明略人工智能(集团)有限公司 New word discovery method, system, electronic device and medium
CN115034211A (en) * 2022-05-19 2022-09-09 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium
CN115146632A (en) * 2022-06-23 2022-10-04 淮阴工学院 New word recognition-based chemical field word segmentation method

Also Published As

Publication number Publication date
CN105260362B (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN105260362B (en) New words extraction method and apparatus
CN106649818B (en) Application search intention identification method and device, application search method and server
US20210056571A1 (en) Determining of summary of user-generated content and recommendation of user-generated content
JP3882048B2 (en) Question answering system and question answering processing method
CN109597986A (en) Localization method, device, equipment and the storage medium of abnormal problem
US20180052823A1 (en) Hybrid Classifier for Assigning Natural Language Processing (NLP) Inputs to Domains in Real-Time
CN107885717B (en) Keyword extraction method and device
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN104484380A (en) Personalized search method and personalized search device
CN103761254A (en) Method for matching and recommending service themes in various fields
US10902197B1 (en) Vocabulary determination and vocabulary-based content recommendations
KR101541306B1 (en) Computer enabled method of important keyword extraction, server performing the same and storage media storing the same
CN105653547A (en) Method and device for extracting keywords of text
CN114818729A (en) Method, device and medium for training semantic recognition model and searching sentence
Rathan et al. Every post matters: a survey on applications of sentiment analysis in social media
Gambini et al. Tweets2Stance: users stance detection exploiting zero-shot learning algorithms on tweets
CN111160699A (en) Expert recommendation method and system
CN115495636A (en) Webpage searching method, device and storage medium
CN112541069A (en) Text matching method, system, terminal and storage medium combined with keywords
US20140278375A1 (en) Methods and system for calculating affect scores in one or more documents
CN109783175B (en) Application icon management method and device, readable storage medium and terminal equipment
CN115169368B (en) Machine reading understanding method and device based on multiple documents
CN108711073B (en) User analysis method, device and terminal
CN114416977A (en) Text difficulty grading evaluation method and device, equipment and storage medium
CN115130455A (en) Article processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant