CN105260362A

CN105260362A - New word extraction method and device

Info

Publication number: CN105260362A
Application number: CN201510729084.7A
Authority: CN
Inventors: 赵旭海; 孟超; 王海洲; 张寅�
Original assignee: Xiaomi Inc
Current assignee: Xiaomi Inc
Priority date: 2015-10-30
Filing date: 2015-10-30
Publication date: 2016-01-20
Anticipated expiration: 2035-10-30
Also published as: CN105260362B

Abstract

The invention provides a new word extraction method and device. The new word extraction method comprises the following steps: calculating the cohesion degree of multiple candidate lexical units, wherein the cohesion degree represents the probability that the candidate lexical units are used as fixed words or set phrases; calculating the degree of freedom of the multiple candidate lexical units, wherein the degree of freedom represents the flexibility of the matching between the candidate lexical units and the fixed words or set phrases, and if the value of the degree of freedom is higher, the number of the fixed words or set phrases capable of being matched with the candidate lexical units is more; respectively performing weighted calculation on the calculated cohesion degree and degree of freedom of all the candidate lexical units, so as to obtain the weighted sum; extracting candidate words or candidate phrases from the candidate lexical units based on the calculated weighted sum. According to the invention, the new candidate words or candidate phrases can be extracted from the candidate lexical units more intelligently, and the extraction accuracy of the candidate words or candidate phrases can be remarkably improved.

Description

New word extraction method and device

Technical Field

The disclosure relates to the field of terminals, in particular to a new word extraction method and device.

Background

With the increase of the types of goods sold by e-commerce websites and the progress of daily word and sentence searching of users, a plurality of unique brand names and popular word collocations appear. These word collocations are usually not stored in the original word segmentation word list of the search engine, so that the search engine may not be accurately split for some specific word collocations searched by the user, and the search result may not meet the expectation of the user.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a new word extraction method and apparatus.

According to a first aspect of the embodiments of the present disclosure, there is provided a new word extraction method, including:

calculating the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;

calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;

respectively carrying out weighted calculation on the calculated aggregation degree and the calculated freedom degree of each candidate lemma in the plurality of candidate lemmas to obtain weighted sums;

and extracting candidate words or candidate phrases from the candidate lemmas based on the calculated weighted sum.

Optionally, the method further includes:

obtaining a corpus;

and carrying out lexical element splitting on the obtained corpus to obtain the candidate lexical elements.

Optionally, the corpus includes a commodity name and a user search log.

Optionally, the calculating the aggregation degrees of the multiple candidate lemmas includes:

counting the occurrence times of the candidate lemmas in all the linguistic data;

sequentially selecting the candidate word elements as target candidate word elements;

and calculating the degree of cohesion of the selected target candidate lemmas according to a degree of cohesion calculation formula.

Wherein, the calculation formula of the degree of agglomeration is as follows:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

or,

S (A_{1} A_{2} ... A_{n}) = \max {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

said S (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Degree of agglomeration of (a); the P (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Probability of occurrence in all corpora; the P (A)₁) And P (A)₂…A_n) Respectively represent the word elements [ A ] from the candidate₁A₂…A_n]The split character [ A ] in₁]And a character string [ A₂…A_n]Probability of occurrence in all corpora; the occurrence probability is the ratio of the occurrence times in all the corpora to the total length of the corpora; the value of n represents the number of characters or letters constituting the candidate lemma.

Optionally, the calculating the degrees of freedom of the plurality of candidate lemmas includes:

recording adjacent characters of the candidate lemmas, and counting the occurrence times of the adjacent characters of the candidate lemmas in all linguistic data;

calculating the occurrence probability of the adjacent characters of the selected target candidate lemmas in all linguistic data;

calculating information entropy of the adjacent word based on the calculated occurrence probability;

and adding the calculated information entropies of all the adjacent words of the target candidate lemma to obtain the degree of freedom of the target candidate lemma.

Optionally, the weighting calculation of the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas to obtain a weighted sum includes:

respectively carrying out weighted calculation on the aggregation degree and the freedom degree of each candidate lemma in the candidate lemma according to a preset weighting formula to obtain a weighted sum;

the preset weighting formula is as follows:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

or,

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega₁Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega₂Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega₃And expressing the weight proportion preset for the correction parameter.

Optionally, the correction parameter is the occurrence probability of the candidate lemma in all the corpora;

the method further comprises the following steps:

judging whether the total length of the corpus is greater than a preset threshold value or not;

when the total length of the linguistic data is lower than a preset threshold value, the weight proportion of the occurrence probability of the candidate lemmas in all the linguistic data is improved based on a preset amplitude;

and when the total length of the linguistic data is greater than a preset threshold value, reducing the weight proportion of the occurrence probability of the candidate lemmas in all the linguistic data based on a preset amplitude.

Optionally, the extracting candidate words or candidate phrases from the candidate lemmas based on the weighted sum obtained by the calculation includes:

sorting the weighted sums obtained by calculation according to the numerical value;

extracting m candidate lemmas with the highest weighted sum in the candidate lemmas as candidate words or candidate phrases based on the ordering;

wherein the value of m is set by a user.

Optionally, the method further includes:

outputting the extracted candidate words or candidate phrases on a manual review interface;

and importing the candidate words or candidate phrases which are checked in the manual checking interface into spelling suggestions or word segmentation word lists of a search engine.

According to a second aspect of the embodiments of the present disclosure, there is provided a new word extraction apparatus, the apparatus including:

a first calculation module configured to calculate a degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;

a second calculation module configured to calculate degrees of freedom of the plurality of candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;

a third calculation module configured to perform weighted calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculation module and the second calculation module to obtain a weighted sum;

an extracting module configured to extract candidate words or candidate phrases from the plurality of candidate lemmas based on the weighted sum calculated by the third calculating module.

Optionally, the apparatus further comprises:

an acquisition module configured to acquire a corpus;

and the splitting module is configured to split the morphemes of the corpus acquired by the acquiring module to obtain the candidate morphemes.

Optionally, the corpus includes a commodity name and a user search log.

Optionally, the first computing module includes:

the first statistic submodule is configured to count the occurrence times of the candidate lemmas in all the linguistic data;

a first selection submodule configured to sequentially select the plurality of candidate lemmas as target candidate lemmas;

a first calculation submodule configured to calculate the degree of aggregation of the target candidate lemma selected by the first selected submodule according to an aggregation calculation formula.

Wherein, the calculation formula of the degree of agglomeration is as follows:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

or,

S (A_{1} A_{2} ... A_{n}) = \max {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

Optionally, the second computing module includes:

the second statistic submodule is configured to record adjacent words of the candidate lemmas and count the occurrence times of the adjacent words of the candidate lemmas in all the linguistic data;

a second selection submodule configured to sequentially select the plurality of candidate lemmas as target candidate lemmas;

a second calculating submodule configured to calculate occurrence probabilities of adjacent words of the target candidate lemma selected by the second selecting submodule in all the corpora;

a third calculation submodule configured to calculate an information entropy of the adjacent word based on the occurrence probability calculated by the second calculation submodule;

an adding sub-module configured to add the information entropies of all the adjacent words of the target candidate lemma calculated by the third calculation module to obtain the degree of freedom of the target candidate lemma.

Optionally, the third computing module includes:

the weighting submodule is configured to perform weighting calculation on the aggregation degree and the freedom degree of each candidate lemma in the plurality of candidate lemmas calculated by the first calculation module and the second calculation module according to a preset weighting formula to obtain a weighted sum;

the preset weighting formula is as follows:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

or,

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

the third computing module further comprises:

the judging submodule is configured to judge whether the total length of the corpus is greater than a preset threshold value;

the improvement submodule is configured to improve the weight proportion of the occurrence probability of the candidate lemma in all the corpora based on a preset amplitude when the total length of the corpora is lower than a preset threshold;

and the reducing submodule is configured to reduce the weight proportion of the occurrence probability of the candidate lemma in all the corpuses based on a preset amplitude when the total length of the corpuses is greater than a preset threshold value.

Optionally, the extracting module includes:

the sorting submodule is configured to sort the weighted sum calculated by the third calculation module according to the numerical value;

an extraction sub-module configured to extract m candidate lemmas with the highest weighted sum among the plurality of candidate lemmas as candidate words or candidate phrases based on the ranking; wherein the value of m is set by a user.

Optionally, the extracting module further includes:

the output sub-module is configured to output the candidate words or candidate phrases extracted by the extraction sub-module on a manual review interface;

and the import sub-module is configured to import the candidate words or candidate word groups which are approved in the manual review interface into spelling suggestions or word segmentation lists of a search engine.

According to a third aspect of the embodiments of the present disclosure, there is also provided a new word extraction device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the above embodiments of the present disclosure, the degree of aggregation of the plurality of candidate lemmas is calculated; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of new word extraction, according to an example embodiment;

FIG. 2 is a flow diagram illustrating another method of extracting new words in accordance with an illustrative embodiment;

FIG. 3 is a schematic block diagram of a new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 4 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 5 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 6 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 7 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 8 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 9 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

FIG. 10 is a schematic block diagram of another new word extraction apparatus shown in accordance with an exemplary embodiment;

fig. 11 is a schematic structural diagram illustrating an apparatus for extracting new words according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the existing implementation, for word collocations which are not stored in the original word segmentation word list of the search engine, when the search engine splits the word collocations, the word collocations may not be split accurately, so that the search result does not meet the expectation of the user.

In order to solve the above problems, it is a common practice at present to find and extract latest word collocations from a corpus in time according to a new word finding strategy, and then introduce the extracted newly appearing word collocations into a participle word list of a search engine.

In the related technology, the current new word discovery strategy is usually discovered based on word frequency statistics, if a plurality of characters or words are collocated as words and appear for a plurality of times, the characters or words are likely to be fixed words or fixed word groups, therefore, the first N most frequently appearing words or words are extracted by sequencing the appearing times of the characters or words, existing words or word groups in a word segmentation word list of a search engine are removed, and the rest words or word groups are new words or word groups.

However, the above new word discovery strategy does not consider the inseparable degree inside the word, i.e. whether the word is a common fixed collocation or not, and also does not consider whether the word can flexibly appear in different contexts, so that the accuracy of new word extraction is poor, and missing or error situations are easy to occur.

In view of the above, the present disclosure provides a new word extraction method, which calculates the degree of aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.

As shown in fig. 1, fig. 1 is a new word extraction method according to an exemplary embodiment, where the new word extraction method is used in a server, and includes the following steps:

in step 101, calculating the aggregation of a plurality of candidate lemmas; the degree of aggregation characterizes a probability of the candidate token as a fixed word or a fixed phrase.

In step 102, calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be collocated with.

In step 103, the calculated aggregation and the calculated degree of freedom of each of the plurality of candidate lemmas are weighted to obtain a weighted sum.

In step 104, candidate words or candidate phrases are extracted from the candidate lemmas based on the calculated weighted sum.

The server side can comprise a server, a server cluster or a cloud platform constructed based on the server cluster, wherein the server provides search services for users. For example, in an application scenario of an e-commerce website, the server may be a server providing a commodity search service to users. The server can split the keywords input by the user through a word segmentation word list in the search engine, and searches and outputs matched commodities for the user in the commodities to be sold according to the split lemmas.

The technical solution of the present disclosure is described in detail below with an application scenario of an e-commerce website, in which the server is a server that provides a commodity search service for a user as an example.

In an application scenario of an e-commerce network, the candidate lemma may be a candidate word or a candidate phrase obtained by a server splitting the lemma for the obtained corpus.

The corpus may include names of commodities and search logs of users. When the server acquires the corpus, the name of the commodity to be sold on the new shelf and the search log of the user can be periodically extracted.

For the commodity names, the server may periodically extract the commodity names under different commodity classifications, and then may write the extracted commodity names into different texts according to the commodity classifications in a certain order, for example, each line in the text corresponds to one commodity name. For the search log of the user, the server can record the search words input by each search operation of the user, and then write the same search words in the recorded search words into the text according to a certain sequence after merging the same search words; for example, there may still be one search term per line in the text.

In the present disclosure, the obtained corpus may be preprocessed by the server. When preprocessing the corpus, the server can split the obtained corpus into a plurality of candidate lemmas. The specific splitting mode of the server when performing lemma splitting on the material can be different splitting modes.

In an embodiment shown in the present disclosure, the server may set a maximum splitting length and a minimum splitting length, both of which may be set by a user in a customized manner, where the maximum splitting length is a maximum length of the split lemmas, and the minimum splitting length is a minimum length of the split lemmas. The minimum splitting length can be set to be 2, so that the situation that single words do not exist in the finally split lemmas is guaranteed.

When the minimum splitting length and the maximum splitting length are set, a length range of the lexeme splitting can be formed, and when the server splits the lexemes, the server can split the lexemes according to each splitting length value in the length range of the lexeme splitting so as to split the lexemes into a plurality of candidate lexemes.

For example, assuming that the corpus acquired by the server includes a commodity name "millet mobile phone version", the maximum splitting length set by the user is 4, and the minimum splitting length is 4, when the server splits the lemma of the commodity name, the server may split the commodity name into three candidate lemmas, namely "millet", "mobile phone" and "version", according to the minimum splitting length 2, wherein the candidate lemma "version" may be discarded because the length is less than 2. After the name of the commodity is split according to the minimum splitting length 2, the name of the commodity can be continuously split into a small hand, a mobile phone and a version according to the splitting length 3. After the name of the commodity is split according to the splitting length 3, the name of the commodity can be continuously split into two morphemes of 'millet mobile phone' and 'mobile version' according to the maximum splitting length 4. At this time, the splitting of the commodity name is completed, and the finally split candidate lemmas include 6 candidate lemmas such as "millet", "mobile phone", "millet hand", "mobile phone", "millet mobile phone", and "mobile version".

After the server completes the splitting of the obtained corpus according to the above mode, a plurality of candidate lemmas can be obtained. At this time, the server may record the adjacent words of each candidate lemma respectively, and count the recorded adjacent words of each candidate lemma and the occurrence number of each candidate lemma in all the corpora.

The adjacent characters of each candidate word element comprise left adjacent characters and right adjacent characters, one candidate word element may have a plurality of adjacent words, and for the candidate word elements which can be flexibly matched with other fixed words or fixed word groups, the adjacent characters of the candidate word element may be richer; for example, for the candidate word element "millet", it may be configured with "box" as a fixed phrase "millet box", or with "mobile phone" as a fixed phrase "millet mobile phone". Thus, for a candidate token, the more words or phrases it can match, the more the candidate token is adjacent.

When recording the adjacent words of each candidate lemma, the server can realize the reverse search in the corpus to which the candidate lemma belongs. For example, suppose the corpus is the name of a commodity, "millet mobile phone version", the candidate lemma is "mobile phone", and the left side adjacent word of the candidate lemma is the character "meter", and the right side adjacent word is the character "move". Of course, for the lemmas without left or right adjacent words, such as candidate lemmas "millet" and "mobile version", the server may mark the left or right non-existing adjacent words in the record when recording the adjacent words.

When the server counts the occurrence frequency of the candidate lemma or the adjacent word in all the linguistic data, the server can scan the candidate lemma or the adjacent word in all the linguistic data as an index, and the occurrence frequency is increased by one every time the candidate lemma or the adjacent word is scanned. For example, assuming that the candidate lemma is "millet", the server may scan the commodity name and the user search log in all the corpora using "millet" as a keyword, and may increment the number of occurrences of the lemma by one every time the scan is completed. Assuming that the adjacent word is'm', the server can scan the commodity name and the user search log in all the linguistic data by taking'm' as a key word, and the occurrence number of the lemma can be increased by one every time the commodity name and the user search log are scanned.

For the recorded adjacent words of each candidate lemma, the counted occurrence frequency of each candidate lemma and the adjacent words of each candidate lemma in all the linguistic data can be stored in the same text together with the candidate lemma by the server. For example, the server may store the candidate lemmas in text, where each line of the text corresponds to a respective candidate lemma. The server can store the recorded adjacent characters, the counted occurrence frequency of the candidate word element and the candidate word element in the same line, so that the server can conveniently read the candidate word element.

The foregoing describes a preprocessing process of the server on the split candidate lemmas.

In this disclosure, after the server finishes the preprocessing process of the split candidate lemmas, the server may calculate the aggregation and the degree of freedom of each candidate lemma based on the counted parameters such as the number of occurrences of the candidate lemma, the adjacent words of the candidate lemma, and the number of occurrences of the adjacent words.

Wherein, the degree of aggregation characterizes the probability of the candidate word element as a fixed word or a fixed phrase. The indicator can be used for evaluating whether a lemma is a common fixed collocation, namely whether characters forming the lemma are accidentally pieced together. For example, if character a is unrelated to character B, then the probability that they will accidentally form a word AB is p (a) x p (B), where p (a) represents the probability that a appears in the corpus. If p (AB) is calculated to be much greater than p (a) p (B), then it can be assumed that the word AB is not a random spelling of the character a and the character B, but rather a fixed word collocation.

Based on this, in one embodiment shown, the server, when calculating the degree of aggregation of the candidate lemmas, may perform the calculation by the following degree of aggregation calculation formula:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

in the above formula, S (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Degree of agglomeration of (a); p (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Probability of occurrence in all corpora. Accordingly, P (A)₁) And P (A)₂…A_n) Respectively represent the word elements [ A ] from the candidate₁A₂…A_n]The split character [ A ] in₁]And a character string [ A₂…A_n]Probability of occurrence in all corpora, by analogy, P (A)₁A₂) And P (A)₃…A_n) Respectively represent the word elements [ A ] from the candidate₁A₂…A_n]The split character [ A ] in₁A₂]And a character string [ A₃…A_n]In all languagesProbability of occurrence in the material.

In the present disclosure, the occurrence probability may be expressed by a ratio of the occurrence number in all the corpuses to the total length of the corpuses; for example, when calculating the probability of occurrence of a candidate lemma, the server may use the ratio of the number of occurrences of the candidate lemma to the total length of the corpus, which is obtained by statistics in the preprocessing stage of the corpus. When calculating the occurrence probability of the adjacent word, the server may use the ratio of the number of occurrences of the adjacent word counted in the preprocessing stage of the corpus to the total length of the corpus to represent the adjacent word. The value of n in the above formula represents the number of characters or letters of the candidate lemma. For example, assume that the candidate lemma is "millet" when the n value is 2.

When the server calculates the aggregation of the candidate lemmas by using the above formula, the server may sequentially select all the candidate lemmas obtained by splitting as target candidate lemmas, and then calculate the aggregation of all the candidate lemmas obtained after splitting by using the above formula for each selected target candidate lemma.

When the server calculates the degree of aggregation of the target candidate lemmas, the server can firstly continue to split the candidate lemmas. To calculate candidate lemma [ A ]₁A₂A₃A₄]For example, the server may split the candidate token into [ A ]₁]And [ A ]₂A₃A₄]、[A₁A₂]And [ A ]₃A₄]And [ A₁A₂A₃]And [ A ]₄]And 6 combinations.

When the split is complete, the server can compute [ A ] separately₁]、[A₂A₃A₄]、[A₁A₂]、[A₃A₄]、[A₁A₂A₃]And [ A ]₄]Probability of occurrence in all corpora. Wherein, the candidate lemma [ A ] is calculated₁A₂A₃A₄]Check out of [ A ]₁]、[A₂A₃A₄]、[A₁A₂]、[A₃A₄]、[A₁A₂A₃]And [ A ]₄]The specific process of occurrence probability and the calculation of candidate lemma [ A ]₁A₂A₃A₄]The specific process of occurrence probability of (c) is the same. For example, the server is calculating [ A ]₁]Probability of occurrence of P (A)₁) Then [ A ] can be counted again₁]And calculating the ratio of the counted occurrence times to the total length of the linguistic data.

When the server calculates the split [ A ]₁]And [ A ]₂A₃A₄]、[A₁A₂]And [ A ]₃A₄]And [ A₁A₂A₃]And [ A ]₄]After the occurrence probabilities of the 6 combinations are equal, the calculated occurrence probabilities can be substituted into the above formula to calculate the candidate lemma [ A₁A₂A₃A₄]Degree of aggregation of (a).

For example, assuming that the candidate lemma is "millet cell phone", the server may split the candidate lemma into 6 combinations of "small", "rice cell phone", "millet", "cell phone", "millet cell phone", and "machine", instead of calculating the occurrence probabilities of the six combinations in all corpora, and then substituting the calculated occurrence probabilities into the above formula for calculation. When the calculated occurrence probabilities of the 6 combinations are substituted into the formula, 3 groups of values can be obtained.

The value of group 1 is the ratio of the occurrence probability of the candidate lemma "millet cell phone" and the product of the occurrence probabilities of "small" and "rice cell phone" in the combination of the above 6.

The value of group 2 is the ratio of the occurrence probability of the candidate lemma "millet mobile phone" and the product of the occurrence probabilities of "millet" and "mobile phone" in the combination of the above 6.

The value of group 3 is the ratio of the occurrence probability of the candidate lemma 'millet mobile phone' and the product of the occurrence probabilities of the 'millet mobile phone' and the 'mobile phone' in the combination of the group 6.

The minimum value of the 3 sets of values can be taken as the degree of aggregation of the candidate lemma "millet mobile phone" based on the above formula.

Note that, in implementation, the maximum value among the 3 sets of values may be used as the degree of aggregation of the candidate token "millet mobile phone", and is not particularly limited in the present disclosure.

Therefore, in another embodiment shown, the above formula for calculating the degree of agglomeration may also be expressed as the following formula, and the specific calculation process is not described again:

S (A_{1} A_{2} ... A_{n}) = \max {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

the above describes the calculation process of the degree of aggregation of the candidate lemmas, and in this way, the degree of aggregation of each candidate lemma can be calculated.

In the present disclosure, after the server calculates the degree of aggregation of each candidate lemma, the server may calculate the degree of freedom of each candidate lemma based on the parameters of the adjacent word of the candidate lemma that have been counted, the number of occurrences of the adjacent word, and the like.

The degree of freedom represents the flexibility of matching the candidate word element with the fixed word or the fixed phrase. The index is used for measuring whether a candidate lemma can flexibly appear in various different contexts, namely whether the left and right collocation of the candidate lemma is flexible, if the freedom degree value of the candidate lemma is low, the new word is possibly only one part of a certain fixed word collocation, and the higher the freedom degree value of the candidate lemma is, the more fixed words or fixed word groups can be collocated with the candidate lemma. For example, if the candidate lemma is "millet", if the candidate lemma appears 363 times in all the names of goods and 17 different words or phrases can be collocated on the right side of the candidate lemma, the degree of freedom of the word "millet" can be considered to be high.

In this disclosure, when calculating the degrees of freedom of the candidate lemmas, the server may sequentially select all the candidate lemmas obtained by splitting as target candidate lemmas, and then calculate the degrees of freedom for each selected target candidate lemma, so that the degrees of freedom of all the candidate lemmas obtained after splitting may be calculated.

When the server calculates the degree of freedom of the target candidate lemma, the server may first read the neighboring words of the target candidate lemma recorded in the corpus preprocessing stage, at this time, the target candidate lemma may have a plurality of neighboring words, and the server may calculate the information entropy of each neighboring word of the target candidate lemma respectively.

When the information entropy of the adjacent word is obtained, the information entropy can be obtained by the following calculation formula of the information entropy:

I(ω)＝-ΣP*log(p)

in the above formula, I (ω) represents the calculated information entropy. P represents the probability of occurrence of adjacent words of the target candidate lemma in all corpora. The base of the log function in the above formula is not particularly limited in this disclosure, and may be 2 or other values, and those skilled in the art may refer to the description in the prior art.

After the server calculates the information entropies of all the adjacent words of the target candidate lemma through the above formula, the information entropies of many adjacent words of the target candidate lemma can be added, and then the added sum is used as the degree of freedom of the target candidate lemma. It can be seen that if the number of adjacent words of the target candidate lemma is more, the degree of freedom of the target candidate lemma calculated finally is higher, and at this time, the number of fixed words or fixed word groups that can be collocated with the target candidate lemma is more.

The above describes the calculation process of the degrees of freedom of the candidate lemmas, and in this way, the degree of freedom of each candidate lemma can be calculated.

In the disclosure, after the server calculates the aggregation and the degree of freedom of each candidate lemma in all the split candidate lemmas, the server may perform weighted calculation on the aggregation and the degree of freedom of each candidate lemma to obtain weighted sums, and then may extract a new candidate word or a new candidate phrase from all the candidate lemmas according to the weighted sums obtained by the calculation

In an embodiment shown, when the server performs weighting calculation on the aggregation degree and the degree of freedom of each candidate lemma respectively, the weighting calculation may be performed by the following weighting formula:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega₁Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega₂Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega₃And expressing the weight proportion preset for the correction parameter. Wherein, in the above formula, ω₁And the specific value of omega 2 can be customized or adjusted by a user according to actual requirements.

In the present disclosure, after the server calculates the weighted sum of all candidate lemmas through the above weighting formula, the calculated weighted sum of all candidate lemmas may be sorted according to the magnitude of the value, and then the server may extract m candidate lemmas with the highest weighted sum among all candidate lemmas as new candidate words or new candidate phrases based on the sorting. The value of m may be set by a user according to actual requirements, for example, the top 10 candidate lemmas with the highest weighted sum value may be extracted.

In the above embodiment, the two indexes of the degree of aggregation and the degree of freedom are used for weighting, so that the non-detachable degree and the flexible collocation degree of the candidate lemmas can be better measured, but for the commodity name of the e-commerce website and the search log of the user, a plurality of special brand names and proper nouns are contained in the commodity name and the search log, so that only the indexes of the degree of aggregation and the degree of freedom may omit words which are relatively high in occurrence frequency, but are easy to be detached or low in degree of freedom, and the words may be proper nouns of the e-commerce website.

Therefore, in the present disclosure, in order to make the extraction of the candidate word or the candidate phrase more accurate, when performing the above weighting calculation, in addition to weighting using two indexes, i.e., the degree of aggregation and the degree of freedom, the probability of occurrence of the candidate lemma may be introduced into the above weighting calculation as a correction parameter, so that it may be avoided that some words, which have relatively high frequency of occurrence but are easily split or have low degree of freedom, may be missed when performing the new word extraction by using the weighting result.

In another embodiment shown in the present disclosure, when the server performs weighting calculation on the aggregation degree and the degree of freedom of each candidate lemma, the weighting calculation may be performed by the following weighting formula:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

in the above formula, F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega₁Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega₂Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega₃And expressing the weight proportion preset for the correction parameter.

In the present disclosure, when the probability of occurrence of the candidate lemma is used as the correction parameter, C (ω) in the above formula may then represent the probability of occurrence of the candidate lemma, ω₃It represents the weight ratio set by the user for the candidate lemma.

It should be noted that the aggregation, the degree of freedom, and the weight ratio of the occurrence probability of the candidate lemmas in the above formula can be set and adjusted by the user according to the actual needs. In practical application, the total length of the corpus used by the server may also have a certain influence on the result of the weighting calculation; for example, when the total length of the corpus is insufficient, the calculated aggregation degree and the degree of freedom of the candidate lemmas may be inaccurate, and therefore, in this case, the weight ratio of the probability of occurrence of the candidate lemmas is appropriately adjusted using the probability of occurrence of the candidate lemmas as the correction parameter, and the final calculated weight sum result may be corrected.

In this disclosure, when the weight ratio of the occurrence probability of the candidate lemma is adjusted, the server may compare the total length of the corpus with a preset threshold, and determine whether the current total length of the corpus is greater than the preset threshold, and if the total length of the corpus is greater than the preset threshold, the corpus is sufficient, and the aggregation and the degree of freedom of the candidate lemma calculated by the server are generally accurate, so the server may appropriately reduce the weight ratio of the occurrence probability of the candidate lemma based on a preset amplitude. In the same way, if the total length of the corpus is less than the preset threshold, the corpus may be insufficient at this time, and the aggregation and the degree of freedom of the candidate lemmas calculated by the server are usually inaccurate, so the server can appropriately reduce the weight proportion of the occurrence probability of the candidate lemmas based on the preset amplitude, thereby being helpful to reduce the inaccurate aggregation and degree of freedom caused by the insufficiency of the corpus and correcting the finally calculated weighted sum to the maximum extent.

The preset amplitude may be set according to an actual length of the corpus total length exceeding or falling below the preset threshold, that is, a higher adjustment amplitude may be set when the actual length of the corpus total length exceeding or falling below the threshold is larger, and conversely, a smaller adjustment amplitude may be set when the actual length of the corpus total length exceeding or falling below the threshold is smaller.

For example, the server may divide the actual length of the corpus total length exceeding the threshold into a plurality of levels, and set a corresponding adjustment amplitude for each level, wherein the more the total length exceeds, the greater the adjustment amplitude is. When the adjustment amplitude is determined, the total length of the corpus may exceed the actual length of the preset threshold, and matching may be performed in the multiple levels, and if one of the levels is matched, the weight ratio of the occurrence probability of the candidate lemma may be adjusted based on the adjustment amplitude corresponding to the level.

After the server calculates the weighted sum of all candidate lemmas through the weighting formula, the server can still rank the weighted sum of all the candidate lemmas obtained through calculation according to the numerical value, and then the server can extract m candidate lemmas with the highest weighted sum of all the candidate lemmas as new candidate words or new candidate phrases based on the ranking, and the specific process is not repeated.

In the disclosure, after the server extracts new candidate words or candidate phrases from all the candidate lemmas, the extracted candidate words or candidate phrases may be output in a manual review interface, so that a reviewer performs manual review.

In the manual review page, each output candidate word or candidate phrase can be set to pass and reject review states in decibels. When the output candidate word or candidate word is approved by the auditor, the candidate word can be set to a "pass state". In addition, in the manual review page, a user option for setting an import directory for the candidate words or candidate phrases that have been reviewed may be provided, and the reviewer may set an import directory for the candidate words or candidate phrases that have been reviewed by operating the user option. For example, for the search server, the candidate words or candidate phrases that have passed the review may be imported into the word segmentation list or spelling suggestion of the search engine, so that the user options in the manual review interface may provide two import directories, word segmentation list and spelling suggestion.

For the candidate words or candidate phrases after the review is completed, the server can periodically read the candidate words or candidate phrases set to be in a "pass" state in the manual review interface, read the import directory set by the reviewer, and import the read candidate words or candidate phrases into the corresponding import directory, so that when the user searches again by using the search keywords containing the imported candidate words or candidate phrases, the server can correctly segment the keywords through the word segmentation word list, thereby outputting the matched search result for the user, and in the process of inputting the keywords by the user, the server can also output the imported candidate words or candidate phrases as spelling suggestions to the user, thereby improving the search experience of the user.

Therefore, through the method, the server can continuously acquire the linguistic data, and accurately extract new candidate words or candidate phrases from the linguistic data to enrich the word segmentation word list and spelling suggestions of the search engine, so that the search experience of a user can be continuously optimized, and the accuracy of a search result is improved.

In the above embodiments, the technical solution of the present disclosure is described by taking an application scenario of an e-commerce website as an example, and in practical application, the technical solution of the present disclosure may also be applied to other scenarios. When a person skilled in the art puts the above technical solution into practice in other scenarios, reference may be made to the descriptions in the above embodiments of the present disclosure, and details are not described in the present disclosure.

In the above embodiment, the degree of aggregation of a plurality of candidate lemmas is calculated; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase; calculating degrees of freedom of the candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented; after the aggregation degrees of a plurality of candidate words are calculated, the aggregation degree and the degree of freedom of each of the plurality of candidate lemmas are calculated and weighted to obtain a weighted sum, and a candidate word or a candidate phrase is extracted from the plurality of candidate lemmas based on the weighted sum obtained by calculation.

As shown in fig. 2, fig. 2 is another new word extraction method according to an exemplary embodiment, where the new word extraction method is used in a test terminal, and includes the following steps:

in step 201, corpora are obtained.

In step 202, the obtained corpus is subjected to lemma splitting to obtain the multiple candidate lemmas.

In step 203, the occurrence frequency of the candidate lemmas in all corpora is counted.

In step 204, the candidate lemmas are sequentially selected as target candidate lemmas, and the degree of aggregation of the selected target candidate lemmas is calculated according to an aggregation calculation formula.

Wherein, the calculation formula of the degree of agglomeration is as follows:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

or,

S (A_{1} A_{2} ... A_{n}) = \max {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

In step 205, the adjacent words of the candidate lemmas are recorded, and the occurrence frequency of the adjacent words of the candidate lemmas in all the corpora is counted.

In step 206, the candidate lemmas are sequentially selected as target candidate lemmas, and the occurrence probability of the adjacent words of the selected target candidate lemmas in all the corpora is calculated.

In step 207, the information entropy of the adjacent word is calculated based on the calculated occurrence probability.

In step 208, the information entropies of all the adjacent words of the target candidate lemma calculated are added to obtain the degree of freedom of the target candidate lemma.

In step 209, the aggregation and the degree of freedom of each candidate lemma in the plurality of candidate lemmas are weighted according to a preset weighting formula to obtain a weighted sum.

The preset weighting formula is as follows:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

or,

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

In step 210, the weighted sum obtained by calculation is sorted according to the magnitude of the value, and m candidate lemmas with the highest weighted sum among the candidate lemmas are extracted as candidate words or candidate phrases based on the sorting.

In step 211, the extracted candidate words or candidate phrases are output in a manual review interface, and the candidate words or candidate phrases that have been reviewed in the manual review interface are imported into a spelling suggestion or word segmentation vocabulary of a search engine.

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

in the above formula, S (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Degree of agglomeration of (a); p (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Probability of occurrence in all corpora. Accordingly, P (A)₁) And P (A)₂…A_n) Respectively represent the word elements [ A ] from the candidate₁A₂…A_n]The split character [ A ] in₁]And a character string [ A₂…A_n]Probability of occurrence in all corpora, by analogy, P (A)₁A₂) And P (A)₃…A_n) Respectively represent the word elements [ A ] from the candidate₁A₂…A_n]The split character [ A ] in₁A₂]And a character string [ A₃…A_n]Probability of occurrence in all corpora.

S (A_{1} A_{2} ... A_{n}) = \max {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

I(ω)＝-ΣP*log(p)

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

wherein F (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega₁Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega₂Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega₃And expressing the weight proportion preset for the correction parameter. Wherein, in the above formula, ω₁And the specific value of omega 2 can be self-taken by the user according to the actual requirementDefined or adjusted.

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

in the above-mentioned formula, the first and second,f (ω) represents the weighted sum; s (omega) represents the degree of aggregation of the candidate lemmas; omega₁Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega₂Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega₃And expressing the weight proportion preset for the correction parameter.

Corresponding to the embodiment of the new word extraction method, the disclosure also provides an embodiment of a new word extraction device.

Fig. 3 is a schematic block diagram illustrating a new word extraction apparatus according to an exemplary embodiment.

As shown in fig. 3, a new word extraction apparatus 30 according to an exemplary embodiment is shown, including: a first calculation module 301, a second calculation module 302, a third calculation module 303 and an extraction module 304; wherein:

the first calculation module 301 is configured to calculate the aggregation degrees of a plurality of candidate lemmas; the degree of aggregation characterizes the probability of the candidate lemma as a fixed word or a fixed phrase;

the second calculation module 302 configured to calculate degrees of freedom of the plurality of candidate lemmas; the freedom degree characterizes the flexibility of matching the candidate word elements with fixed words or fixed word groups; the higher the degree of freedom value is, the more fixed words or fixed phrases the candidate lemma can be matched with are represented;

the third calculating module 303 is configured to perform weighted calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculating module and the second calculating module to obtain a weighted sum;

the extracting module 304 is configured to extract candidate words or candidate phrases from the candidate lemmas based on the weighted sum calculated by the third calculating module.

Referring to fig. 4, fig. 4 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, and based on the foregoing embodiment shown in fig. 3, the apparatus 30 may further include an obtaining module 305 and a splitting module 306; wherein:

the obtaining module 305 is configured to obtain corpora;

the splitting module 306 is configured to split the morpheme of the corpus acquired by the acquiring module to obtain the multiple candidate morphemes.

The corpus comprises commodity names and user search logs.

Referring to fig. 5, fig. 5 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, where on the basis of the foregoing embodiment shown in fig. 4, the first calculating module 301 may include a first statistical submodule 301A, a first selected submodule 301B, and a first calculating submodule 301C; wherein:

the first statistical submodule 301A is configured to count the occurrence frequency of the candidate lemmas in all corpora;

the first selecting sub-module 301B is configured to sequentially select the candidate lemmas as target candidate lemmas;

the first calculating submodule 301C is configured to calculate the degree of aggregation of the target candidate lemma selected by the first selected submodule according to an aggregation calculation formula.

Wherein, the calculation formula of the degree of agglomeration is as follows:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

or,

S (A_{1} A_{2} ... A_{n}) = \max {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

It should be noted that, in the embodiment of the apparatus shown in fig. 5, the first statistic submodule 301A, the first selecting submodule 301B, and the first calculating submodule 301C are shown; the structure of (2) can also be included in the aforementioned embodiment of the apparatus of fig. 3, and the present disclosure is not limited thereto.

Referring to fig. 6, fig. 6 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, which is based on the foregoing embodiment shown in fig. 5, in which the second calculating module 302 may include a second statistics sub-module 302A, a second selected sub-module 302B, a second calculating sub-module 302C, a third calculating sub-module 302D, and an adding sub-module 302E; wherein:

the second statistical submodule 302A is configured to record adjacent words of the plurality of candidate lemmas, and count the occurrence number of the adjacent words of the plurality of candidate lemmas in all corpora;

the second selecting sub-module 302B is configured to sequentially select the candidate lemmas as target candidate lemmas;

the second calculating submodule 302C is configured to calculate the occurrence probability of the adjacent word of the target candidate lemma selected by the second selected submodule in all the corpora;

the third computation submodule 302D is configured to compute an information entropy of the adjacent word based on the occurrence probability computed by the second computation submodule;

the adding sub-module 302E is configured to add the information entropies of all the adjacent words of the target candidate lemma calculated by the third calculation module to obtain the degree of freedom of the target candidate lemma.

It should be noted that, the second calculation module 302 shown in the apparatus embodiment shown in fig. 6 may include structures of a second statistics sub-module 302A, a second selection sub-module 302B, a second calculation sub-module 302C, a third calculation sub-module 302D, and an addition sub-module 302E, which may also be included in the apparatus embodiments of fig. 3 to 4, and the disclosure is not limited thereto.

Referring to fig. 7, fig. 7 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, where on the basis of the foregoing embodiment shown in fig. 6, the third calculating module 303 may include a weighting sub-module 303A; wherein:

the weighting submodule 303A is configured to perform weighting calculation on the degree of aggregation and the degree of freedom of each candidate lemma calculated by the first calculation module and the second calculation module according to a preset weighting formula to obtain a weighted sum;

the preset weighting formula is as follows:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

or,

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

It should be noted that the structure of the weighting sub-module 303A shown in the apparatus embodiment shown in fig. 7 may also be included in the apparatus embodiments of fig. 3 to 5, and the disclosure is not limited thereto.

Referring to fig. 8, fig. 8 is a block diagram of another new word extraction apparatus according to an exemplary embodiment of the present disclosure, where the embodiment is based on the foregoing embodiment shown in fig. 7, and the modification parameter is the probability of occurrence of the candidate lemma in all corpora; the third calculation module 303 may further include a judgment sub-module 303B, an increase sub-module 303C, and a decrease sub-module 303D; wherein:

the judgment submodule 303B is configured to judge whether the total length of the corpus is greater than a preset threshold;

the increasing submodule 303C is configured to increase, based on a preset amplitude, a weight ratio of occurrence probabilities of the candidate lemmas in all the corpuses when the total length of the corpuses is lower than a preset threshold;

the reducing submodule 303D is configured to reduce, when the total length of the corpuses is greater than a preset threshold, a weight ratio of occurrence probabilities of the candidate lemmas in all the corpuses based on a preset amplitude.

It should be noted that, the structures of the judgment sub-module 303B, the increase sub-module 303C and the decrease sub-module 303D shown in the apparatus embodiment shown in fig. 8 may also be included in the apparatus embodiments of fig. 3 to 7, and the disclosure is not limited thereto.

Referring to fig. 9, fig. 9 is a block diagram of another new word extracting apparatus according to an exemplary embodiment of the present disclosure, which is based on the foregoing embodiment shown in fig. 3, in which the extracting module 304 may include a sorting sub-module 304A and an extracting sub-module 304B; wherein:

the sorting submodule 304A is configured to sort the weighted sum calculated by the third calculation module according to the numerical value;

the extracting sub-module 304B is configured to extract m candidate lemmas with the highest weighted sum among the candidate lemmas as candidate words or candidate phrases based on the ranking; wherein the value of m is set by a user.

It should be noted that the structures of the sorting sub-module 304A and the extracting sub-module 304B shown in the apparatus embodiment shown in fig. 9 may also be included in the apparatus embodiments of fig. 4 to 8, and the disclosure is not limited thereto.

Referring to fig. 10, fig. 10 is a block diagram of another new word extracting apparatus shown in the present disclosure according to an exemplary embodiment, in which on the basis of the foregoing embodiment shown in fig. 9, the extracting module 304 may further include an output sub-module 304C and an import sub-module 304D; wherein:

the output sub-module 304C is configured to output the candidate words or candidate phrases extracted by the extraction sub-module on a manual review interface;

the import sub-module 304D is configured to audit the passed candidate words or candidate word groups in the manual audit interface and import spelling suggestions or word segmentation word lists of a search engine.

It should be noted that the structures of the output sub-module 304C and the import sub-module 304D shown in the apparatus embodiment shown in fig. 10 may also be included in the apparatus embodiments of fig. 3 to 8, and the disclosure is not limited thereto.

The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

Correspondingly, this disclosure still provides a new word extraction element, the device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

Accordingly, the present disclosure also provides a server comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

Fig. 11 is a block diagram illustrating an apparatus 1100 for new word extraction according to an example embodiment. For example, the apparatus 1100 may be provided as a server.

Referring to fig. 11, the apparatus 1100 includes a processing component 1122 that further includes one or more processors and memory resources, represented by memory 1132, for storing instructions, such as application programs, executable by the processing component 1122. The application programs stored in memory 1132 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1122 is configured to execute instructions to perform the device binding methods described above.

The apparatus 1100 may also include a power component 1126 configured to perform power management of the apparatus 1100, a wired or wireless network interface 1150 configured to connect the apparatus 1100 to a network, and an input/output (I/O) interface 1158. The device 1100 may operate based on an operating system stored in memory 1132, such as a Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for extracting new words, the method comprising:

2. The method of claim 1, wherein the method further comprises:

obtaining a corpus;

3. The method of claim 2, wherein the corpus comprises names of goods and user search logs.

4. The method of claim 2, wherein said calculating the degree of aggregation of the plurality of candidate lemmas comprises:

Wherein, the calculation formula of the degree of agglomeration is as follows:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

or,

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

5. The method of claim 4, wherein said computing degrees of freedom for a plurality of candidate tokens comprises:

6. The method of claim 5, wherein the weighted computing of the calculated degree of aggregation and the degree of freedom for each of the plurality of candidate tokens, respectively, to obtain a weighted sum comprises:

the preset weighting formula is as follows:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

or,

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

7. The method of claim 6, wherein the modification parameter is a probability of occurrence of the candidate lemma in all corpora;

the method further comprises the following steps:

8. The method of claim 6, wherein said extracting candidate words or candidate phrases from said plurality of candidate tokens based on said calculated weighted sum comprises:

wherein the value of m is set by a user.

9. The method of claim 8, wherein the method further comprises:

10. A new word extraction apparatus, characterized in that the apparatus comprises:

11. The apparatus of claim 10, wherein the apparatus further comprises:

an acquisition module configured to acquire a corpus;

12. The apparatus of claim 11, wherein the corpus comprises names of goods and user search logs.

13. The apparatus of claim 11, wherein the first computing module comprises:

Wherein, the calculation formula of the degree of agglomeration is as follows:

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

or,

S (A_{1} A_{2} ... A_{n}) = m i n {\frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1}) * P (A_{2} ... A_{n})}, \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} A_{2}) * P (A_{3} ... A_{n})} ... \frac{P (A_{1} A_{2} ... A_{n})}{P (A_{1} ... A_{n - 1}) * P (A_{n})}}

said S (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Degree of agglomeration of (a); the P (A)₁A₂…A_n) Represents candidate lemma [ A ]₁A₂…A_n]Probability of occurrence in all corpora; the P (A)₁) And P (A)₂…A_n) Respectively represent the word elements [ A ] from the candidate₁A₂…A_n]The split character [ A ] in₁]And a character string [ A₂…A_n]Probability of occurrence in all corpora; what is needed isThe occurrence probability is the ratio of the occurrence times in all the linguistic data to the total length of the linguistic data; the value of n represents the number of characters or letters constituting the candidate lemma.

14. The apparatus of claim 13, wherein the second computing module comprises:

15. The apparatus of claim 14, wherein the third computing module comprises:

the preset weighting formula is as follows:

F(ω)＝S(ω)*ω₁+I(ω)*ω₂

or,

F(ω)＝S(ω)*ω₁+I(ω)*ω₂+C(ω)*ω₃

wherein F (ω) represents the weighted sum; s (omega)) Representing the degree of aggregation of the candidate lemmas; omega₁Expressing a preset weight ratio for the degree of aggregation; i (ω) represents the degree of freedom of the candidate lemma; omega₂Representing a weight ratio preset for the degree of freedom in advance; c (ω) represents a correction parameter; omega₃And expressing the weight proportion preset for the correction parameter.

16. The apparatus of claim 15, wherein the modification parameter is a probability of occurrence of the candidate lemma in all corpora;

the third computing module further comprises:

17. The apparatus of claim 15, wherein the extraction module comprises:

18. The apparatus of claim 17, wherein the extraction module further comprises:

19. A new word extraction device, characterized by comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: