CN111831832B

CN111831832B - Word list construction method, electronic device and computer readable medium

Info

Publication number: CN111831832B
Application number: CN202010732672.7A
Authority: CN
Inventors: 王桑; 李成飞; 杨嵩
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2022-07-01
Anticipated expiration: 2040-07-27
Also published as: CN111831832A

Abstract

The embodiment of the invention discloses a word list construction method, which comprises the following steps: the method comprises the steps of screening a plurality of first entries to obtain a plurality of second entries based on word string aggregation degrees, obtaining total combination probability corresponding to each second entry according to word string combination probability and pronunciation combination probability of the screened second entries, and constructing a target word list based on the total combination probability, wherein combination capacity of the entries on word formation and combination capacity of pronunciation of the entries are considered, so that accuracy of the constructed target word list is higher, and construction efficiency is higher.

Description

Word list construction method, electronic device and computer readable medium

Technical Field

The embodiment of the invention relates to the technical field of text processing, in particular to a word list construction method, electronic equipment and a computer readable medium.

Background

Vocabulary construction is a process of obtaining words from existing text corpora and other available resources in an automated or semi-automated manner. It can be seen that the vocabulary construction needs to be based on the text corpus, and in the text corpus, a word or a word is often used as a basic unit and represents a semantic unit. However, in the existing text corpora, one class of text corpora has natural separation, for example, space separation exists between words in the language systems such as English and French, and convenience is provided for word list construction; while there are no natural partitions in another type of corpus, such as chinese, japanese, korean, etc., it is a challenging task to construct a vocabulary for a corpus without natural partitions.

For text corpora without natural separation, the current word list construction method mainly adopts a statistical-based method, namely: through various statistical strategies, the most relevant word string combination is found out from the text corpus, and the word list is constructed according to the statistical characteristics of the word string combination in the corpus. However, in the method of constructing a vocabulary according to the statistical characteristics in the corpus, there are entries that stably appear in the corpus but do not conform to the lexical meaning, which results in low accuracy of the constructed vocabulary.

Disclosure of Invention

The present invention provides a vocabulary construction scheme to at least partially address the above-mentioned problems.

According to a first aspect of an embodiment of the present invention, a method for constructing a vocabulary is provided, where the method includes: acquiring a plurality of first entries meeting a preset rule from a vocabulary corpus to be constructed; calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries, of the first entries, with the word string aggregation degrees larger than a first preset threshold value as second entries, wherein the word string aggregation degrees represent the degrees of the entries meeting grammar rules; obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries; and according to the total combination probability corresponding to each second entry, determining the second entries with the total combination probability larger than a second preset threshold value from the plurality of second entries, and constructing a target vocabulary according to the second entries larger than the second preset threshold value.

According to a second aspect of embodiments of the present invention, there is provided an electronic apparatus, the apparatus including: one or more processors; a computer readable medium configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the vocabulary construction method according to the first aspect.

According to a third aspect of embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the vocabulary construction method according to the first aspect.

According to the scheme provided by the embodiment of the invention: acquiring a plurality of first entries meeting preset rules from a vocabulary corpus to be constructed; calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries with the word string aggregation degrees larger than a first preset threshold value in the first entries as second entries, wherein the word string aggregation degrees represent the degrees of the entries meeting the grammar rules; obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries; and according to the total combination probability corresponding to each second entry, determining the second entries with the total combination probability larger than a second preset threshold value from the plurality of second entries, and constructing a target vocabulary according to the second entries larger than the second preset threshold value. According to the scheme, the first entries are screened based on the word string aggregation degree, then the total combination probability corresponding to each second entry is obtained according to the word string combination probability and the pronunciation combination probability of the screened second entries, the target vocabulary is constructed based on the total combination probability, the combination capability of the entries on word construction is considered, the combination capability of the pronunciation of the entries is also considered, the accuracy of the constructed target vocabulary is higher, and the construction efficiency is higher.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments thereof, made with reference to the following drawings:

FIG. 1 is a flowchart illustrating steps of a vocabulary construction method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating another step of a vocabulary construction method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a further step of a vocabulary construction method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a second embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example one

Referring to fig. 1, a flowchart illustrating steps of a vocabulary constructing method according to a first embodiment of the present invention is shown.

The word list construction method of the embodiment comprises the following steps:

step 101, obtaining a plurality of first entries meeting preset rules from a vocabulary corpus to be constructed.

In this embodiment, the vocabulary corpus to be constructed may include positive samples (sentences or phrases conforming to the normal language order) or negative samples (sentences or phrases not conforming to the normal language order), and this embodiment is not limited.

For example: the vocabulary corpus to be constructed can be composed of the following multiple sections of texts:

' Dajiahao I call Liu Zi Xuan. "

"go to school without going to the right. "

"Notice that we focus on this lesson. "

It is understood that each line may be composed of one sentence or a plurality of sentences, or may be composed of one phrase or a plurality of phrases, and the embodiment is not limited.

Alternatively, as shown in fig. 2, step 101 may include the following steps:

step 1011: and acquiring a plurality of initial entries with different lengths from the vocabulary corpus to be constructed through a sliding window, wherein the size of the window is at least one character.

The size of the window may be set according to actual requirements, and this embodiment is not limited, and preferably, the size of the window may be set to 2 characters, 3 characters, and 4 characters, and a plurality of initial entries (the size of the window is 2 characters) with a length of 2 characters, a plurality of initial entries (the size of the window is 3 characters) with a length of 3 characters, and a plurality of initial entries (the size of the window is 4 characters) with a length of 4 characters are obtained from the vocabulary corpus to be constructed through a sliding window, and may be referred to as a binary string, a ternary string, and a quaternary string, respectively.

For example, taking the text "don't care of the right in class" as an example, if the window size is set to 2 characters, the sliding window can obtain binary strings such as "class", "don't care", "want to left", "give a left visit", and the like; if the window size is set to 3 characters, the sliding window can acquire ternary character strings such as 'not in class', 'not to left', 'to give a left look', and 'to give a right look'; if the window size is set to 4 characters, the sliding window can acquire the quaternary word strings of 'don't give lessons ',' don't give left-care', 'look ahead left-hand', etc. It can be understood that, if the window is set in different sizes, the obtained strings are different in length, and are not listed one by one here.

Optionally, the obtained multiple initial entries with different lengths may be constructed as an initial vocabulary, that is, the initial entries are arranged in a table form according to a certain rule.

In an embodiment, the plurality of initial entries of different lengths in the initial vocabulary may be directly used as the plurality of first entries.

In the embodiment, the vocabulary corpus of the vocabulary to be constructed is divided into the binary word string, the ternary word string and the quaternary word string, which basically conform to the language family without natural separation, such as the word construction rule and the language expression rule of languages like Chinese, Japanese and Korean, so that on the basis of ensuring the accuracy rate of constructing the vocabulary, the word strings with other lengths do not need to be acquired, and the acquisition efficiency of the entries is improved.

Step 1012: and counting the occurrence frequency of the plurality of initial entries with different lengths in the vocabulary corpus to be constructed, and acquiring the plurality of initial entries with the occurrence frequency larger than a third preset threshold value as a plurality of first entries.

In another manner of this embodiment, the frequency of occurrence, i.e., word frequency, of each initial entry in the vocabulary corpus to be constructed may be calculated. Because some short words or some long words are low-frequency words, the initial entries with the word frequency smaller than the third preset threshold can be removed, the initial entries with the word frequency larger than the third preset threshold are reserved as a plurality of first entries, and as above, the plurality of initial entries with the word frequency larger than the third preset threshold can also be constructed as a first to-be-screened vocabulary. The third preset threshold value here may be set according to actual conditions, and preferably, may be a smaller value: for example, the third preset threshold may be set to 3.

In this embodiment, the word frequency is used as a screening condition, and the initial entry with the word frequency greater than the third preset threshold is obtained, so that the calculation amount in the subsequent steps can be reduced.

Step 102, calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries with the word string aggregation degrees larger than a first preset threshold value in the first entries as second entries, wherein the word string aggregation degrees represent the degrees of the entries meeting the grammar rules.

In the present embodiment, the grammar rule is a habit that people need to follow when speaking or expressing characters, and is objectively present. Grammar rules may indicate rules for which language units are combined one after another, including rules for morphemes to be combined into words and rules for words to be combined into sentences. The grammatical rules in this embodiment refer mainly to the former, also called lexical rules.

In this embodiment, specifically, the word string aggregation degrees corresponding to the first entries may be obtained through calculation, and the first entries whose word string aggregation degrees corresponding to the first entries are greater than a first preset threshold may be used as the second entries, optionally, the second entries may be constructed as the second vocabulary to be filtered in this embodiment, and optionally, the first preset threshold may be appropriately set by a person skilled in the art according to an actual situation, for example, may be set to 0.1.

In this embodiment, the word string aggregation degree is used as a secondary screening condition, so that each obtained second entry is an entry with a higher combinability degree, and an entry with a lower combinability degree is removed, so that subsequent calculation is more simplified and faster.

Optionally, the word string aggregation degree corresponding to each of the plurality of first terms may be obtained by calculating:

calculating to obtain a first probability of each entry in the first entries appearing in the vocabulary corpus to be constructed and a second probability of each character in each entry appearing in the vocabulary corpus to be constructed; and obtaining the word string aggregation degree corresponding to each of the plurality of first terms based on the first probability and the second probability.

In this embodiment, based on the first probability of each entry appearing in the to-be-constructed vocabulary corpus and the second probability of each character in each entry appearing in the to-be-constructed vocabulary corpus, the obtained word string aggregation level corresponding to each first entry is more accurate.

In this embodiment, specifically, the word string aggregation level corresponding to each of the plurality of first terms may be obtained based on a ratio of a product of the first probability and the plurality of second probabilities of each of the plurality of first terms.

Further, formulas may be utilized

Calculating and obtaining the aggregation degree of the character string corresponding to each entry in the first entries; wherein, I (S)₁) Representing an entry S₁＝c₁,c₂···c_nCorresponding string aggregation, p (c)₁)，p(c₂)...p(c_n) Representing an entry S₁Each character c in₁,c₂...c_nThe probability of occurrence of the word "good" in the vocabulary corpus to be constructed, for example in "good teacher classmates" is 2/7, p (c)₁,c₂...c_n) Representing an entry S₁＝c₁,c₂···c_nThe probability of occurrence in the linguistic data of the vocabulary to be constructed, n is a positive integer and represents an entry S₁＝c₁,c₂···c_nThe number of characters in (1). For example, taking the entry "look ahead left", as an example, when calculating the aggregation of the word string of "look ahead left", it is necessary to calculate the probabilities of "look ahead", "look ahead" and four characters appearing in the vocabulary corpus to be constructed, which correspond to p (c) in the above formula₁)，p(c₂)...p(c_n) And calculating the probability of the entry 'looking ahead left' appearing in the vocabulary corpus to be constructed, corresponding to the formulaP (c) of₁,c₂...c_n). It should be noted that the above formula is merely an example as an optional implementation manner, and the word string aggregation degree of each of the first terms may also be calculated in other manners, which is not particularly limited in this embodiment.

In natural language processing, mutual information is generally used to determine the possibility of collocation relationship between two language units, and the larger the mutual information is, the more likely there is collocation relationship between the language units, i.e. in vocabulary construction, the more likely it is that two word strings are combined into one word or sentence. Therefore, mutual information considers the possibility of collocation between two language units, and the possibility of collocation of a plurality of language units cannot be measured. In the present embodiment, a way of calculating the word string aggregation degree with each character in the word string as an independent language unit is provided, so that the matching possibility (the degree of compliance with the grammar rule) among a plurality of language units can be measured.

Step 103, obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries.

In this embodiment, the string combination probability may represent a combination ability of each second term, and the pronunciation combination probability may represent a combination ability of pronunciation of each second term.

Specifically, as shown in fig. 3, the method may include the following steps:

step 1031: and calculating to obtain the string combination probability corresponding to each entry in the second entries.

Step 1032: and calculating to obtain the pronunciation combination probability corresponding to each entry in the second entries.

Step 1033: and respectively carrying out weighted summation on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries to obtain the total combination probability corresponding to each second entry.

In this embodiment, the calculation of the string combination probability and the pronunciation combination probability has no necessary order, and the pronunciation combination probability may be calculated first, or the string combination probability may be calculated first, or both probabilities may be calculated in parallel, which is not limited in this embodiment.

Because the word frequency and the word string aggregation degree are counted based on the vocabulary corpus to be constructed, the screened entries still have many entries which stably appear in the vocabulary corpus to be constructed but do not accord with the lexical or semantic rules, for example, "good students" frequently appear in the corpus when the corpus of the education scene is processed, but obviously the entries do not accord with the word formation rules (lexical combination rules) of the Chinese language. In the embodiment, the word formation rule of the entry is considered, the word string combination probability is introduced, the pronunciation rule of the entry is also considered, and the pronunciation combination probability is introduced, so that the entry obtained by screening the total combination probability based on the word string combination probability and the pronunciation combination probability is more accurate.

Optionally, in step 1031, a first statistical language model may be used to calculate and obtain a string combination probability corresponding to each entry in the second entries.

In this embodiment, the first statistical language model may be an N-Gram language model based on an assumption that: the occurrence of the nth word is only related to the previous N-1 words, and in practical applications, the specific value of N may be set empirically, and in the present embodiment, N is set to 3, for example. And then training the model to obtain a 3-order N-Gram language model taking the characters as basic units, and calculating the string combination probability of each second entry by using the language model.

In this embodiment, the 3 rd order N-gram language model is used to calculate the string combination probability corresponding to each obtained second entry, which may include all information that can be provided by the first 2 characters in each second entry, so that the obtained string combination probability is more accurate.

Optionally, the probability that each of the second terms meets the lexical combination rule may be calculated by the power of n to obtain a string combination probability corresponding to each of the second terms, where n is a positive integer and represents the number of characters in the current term.

In this embodiment, the probability that each entry conforms to the lexical combination rule may be calculated by the 3-order N-gram language model, and the language model may be obtained by training based on the existing published chinese dictionary, korean dictionary, and the like in advance.

Taking a chinese dictionary as an example, a large-scale word bank table can be generated from all words in the chinese dictionary, and the word bank table can be a chinese word per row, and the words in the word bank table are separated according to words, for example:

all at once

A Maping river

Style of a book

Rich and colorful

……

And training the N-Gram language model by taking the vocabulary in the vocabulary table as a training sample to obtain a 3-order N-Gram language model taking Chinese characters as basic units. And then calculating the probability that each entry in the second entries meets the lexical combination rule by using the trained 3-order N-Gram language model. It will be appreciated by those skilled in the art that other data models that achieve the same functionality are equally applicable, such as a 2 nd order N-Gram language model or other forms of data models.

In the present embodiment, the formula P' (c) may be utilized based on the vocabulary in the above-described vocabulary table₁,c₂...c_n)＝P'(c₁)P(c₂|c₁)...P(c_n|c_n-2,c_n-1) calculating an entry S₂＝c₁,c₂...c_nProbability of meeting lexical combination rules. Wherein n is the number of characters contained in each entry, P (c)₂|c₁) Denotes c₁At c₂Probability of occurrence under the conditions that have occurred, and so on. For example, the probability that the term "style" conforms to the lexical combination rule is calculated, i.e., P (style) ═ P (wind) P (lattice | wind), where P (lattice | wind) represents the probability that a "lattice" word appears when the "wind" word appears.

In particular, a formula may be utilized

Calculating the string combination probability corresponding to each of the second entries; wherein, A (S)₂) Representing an entry S₂＝c₁,c₂...c_nCorresponding string combination probability, P' (c)₁,c₂...c_n) Representing an entry S₂＝c₁,c₂...c_nThe probability of meeting the lexical combination rule is satisfied, n is a positive integer and represents an entry S₂The number of characters in (1). For example, when calculating the word string combination probability of the term "style", the probability that the term "style" meets the lexical combination rule may be squared, that is, the probability is

For another example, for the text "rich and colorful", the string combination probability of "rich" and "colorful" calculated by the N-Gram language model is greater than "rich and much". Therefore, in the embodiment, the N-Gram language model is trained based on the chinese dictionary, and since the vocabularies in the chinese dictionary are all vocabularies conforming to the lexical law and the semantic rule, the N-Gram language model trained based on the chinese dictionary is used to calculate the word string combination probability, thereby avoiding that the finally obtained vocabulary entry does not conform to the lexical law or the semantic rule.

In this embodiment, optionally, in the step 1032, the pronunciation combination probability corresponding to each entry in the plurality of second entries may be obtained through calculation by using a second statistical language model.

The second statistical language model in this embodiment may also use a 3-order N-Gram language model, but the internal parameters may be adjusted according to actual requirements, and may be different from the parameters of the N-Gram language model.

In this embodiment, the pronunciation combination probability corresponding to each second term is calculated by using a 3-order N-gram language model, which may include all information that can be provided by phonemes corresponding to the first 2 characters in each second term, so that the calculated pronunciation combination probability is more accurate. Similarly, other data models that perform the same function are equally suitable, such as a 2 nd order N-Gram language model or other forms of data models.

Optionally, the probability that the phoneme string corresponding to each of the second terms meets the pronunciation combination rule may be calculated by opening to the nth power, so as to obtain the pronunciation combination probability corresponding to each of the second terms, where the phoneme string is a pronunciation representation of the term, and n is the number of phonemes included in the phoneme string.

In this embodiment, the probability that the phoneme string corresponding to each entry conforms to the pronunciation combination rule may also be calculated by the second statistical language model, i.e., the 3 rd order N-gram language model, and is consistent with the concept of calculating the probability that each entry conforms to the lexical combination rule, which is not described herein again.

The 3 rd order N-gram language model in this embodiment can be obtained by training based on phoneme strings in an open-source speech recognition pronunciation dictionary, which is still exemplified herein by a chinese vocabulary, such as an open-source chinese speech recognition pronunciation dictionary like airshell, and the format of the pronunciation dictionary is as follows:

mathematics sh u4 x ve2

Physical uu 4 l i3

Chemical h ua4 x ve2

Style f eng1 g e2

Each line of the phoneme string may be composed of a term and a phoneme string corresponding to the term, and phonemes in the phoneme string are separated by spaces. It is understood that the phonemes are the initial consonants and the final consonants with tones.

In this embodiment, the phoneme strings in the pronunciation dictionary may be sorted out, an N-Gram model may be trained in advance, an N-Gram language model using phonemes as basic units may be obtained, and then the language model may be used to calculate the pronunciation combination probability corresponding to each second entry.

In particular, it can be represented by the formula

Calculating the pronunciation combination probability corresponding to each entry in the second entries; wherein B (S)₂) Indicating a phoneme string S corresponding to each second entry₂＝x₁,x₂...x_nCorresponding pronunciation combinationProbability, P (x)₁,x₂...x_n) Representing a phoneme string S₂＝x₁,x₂...x_nAccording to the pronunciation combination rule, x₁,x₂...x_nAre respectively entry S₂Middle character c₁,c₂...c_nCorresponding phoneme is entry S₂＝c₁,c₂...c_nWhere n is a positive integer, indicates the number of phonemes included in the phoneme string. For example, when the pronunciation combination probability of the term "math" is calculated, the corresponding phoneme string is "sh u4 x ve 2", and the pronunciation combination probability is

When the pronunciation combination probability of the term style is calculated, the corresponding phoneme string is f eng1 g e2, and the pronunciation combination probability is

In practical application scenarios, due to human factors, the entries are not used correctly, such as "casting the head of a person" and "walking without the head", and if only string combination probabilities are considered, the entries with wrongly written or mispronounced words are missed because the correct entries should be "casting the head of the person" and "walking without the head". After the pronunciation combination probability is calculated in the embodiment, the omission of the entries can be avoided.

After obtaining the string combination probability and pronunciation combination probability corresponding to each entry in the second entry, in this embodiment, the above step 1033 may be performed by using the formula T (S)₂)＝αA(S₂)+(1-α)B(S₂) Calculating the total combination probability corresponding to each second entry; wherein, T (S)₂) Representing the total combined probability corresponding to each second entry, alpha represents weight, alpha is more than or equal to 0 and less than or equal to 1, A (S)₂) Representing an entry S₂String combination probability of, B (S)₂) Representing an entry S₂The pronunciation combination probability of (1).

In this embodiment, the weight α may be assigned to different values according to different fields and application scenarios. For exampleIf α is set to 0.8, the weight representing the string combination probability is 0.8. Taking the term "style" as an example, the above-mentioned obtained string combination probability is

And a pronunciation combination probability of

Then its total combined probability is

The total combination probability of the embodiment is calculated based on the string combination probability and the pronunciation combination probability, that is, the word formation rule of the entry and the pronunciation rule of the entry are considered, so that the finally obtained entry has no omission and is high in accuracy.

In this embodiment, the weight of the string combination probability is preferably greater than the weight of the pronunciation combination probability, for example, the weight of the string combination probability may take a value in the interval of 0.6 to 0.9. Namely, the vocabulary entry word-forming rule and the semantic rule are mainly considered, and the pronunciation combination probability is used as a supplementary screening means, so that the finally obtained vocabulary entry has higher accuracy.

And 104, determining second entries with the total combination probability larger than a second preset threshold from the plurality of second entries according to the total combination probability corresponding to each second entry, and constructing a target vocabulary according to the second entries larger than the second preset threshold.

In this embodiment, the second preset threshold may be set according to actual needs or manual experience, and this embodiment is not particularly limited. In this embodiment, the entries with the total combination probability greater than the second preset threshold are retained, and the entries smaller than the second preset threshold are removed, that is, the entries not conforming to the word formation rule or pronunciation rule are removed, so that the constructed vocabulary is more accurate and more practical.

The vocabulary construction method provided by the embodiment of the application comprises the steps of firstly screening a vocabulary corpus to be constructed by combining word frequency to obtain a plurality of first entries, then screening the first entries to obtain a plurality of second entries based on word string aggregation, then obtaining a plurality of final entries from the plurality of second entries by combining total combination probability, and constructing the plurality of finally obtained entries into a target vocabulary.

The vocabulary construction method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example two

Fig. 2 is a hardware structure of an electronic device according to a third embodiment of the present invention; as shown in fig. 2, the electronic device may include: a processor (processor)301, a communication Interface 302, a memory 303, and a communication bus 304.

Wherein:

the processor 301, the communication interface 302, and the memory 303 communicate with each other via a communication bus 304.

A communication interface 302 for communicating with other electronic devices or servers.

The processor 301 is configured to execute the program 305, and may specifically perform relevant steps in the foregoing vocabulary constructing method embodiment.

In particular, program 305 may include program code comprising computer operating instructions.

The processor 301 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement an embodiment of the present invention. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 303 stores a program 305. Memory 303 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 305 may specifically be configured to cause the processor 301 to perform the following operations: acquiring a plurality of first entries meeting preset rules from a vocabulary corpus to be constructed; calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries, of the first entries, with the word string aggregation degrees larger than a first preset threshold value as second entries, wherein the word string aggregation degrees represent the degrees of the entries meeting grammar rules; obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries; and according to the total combination probability corresponding to each second entry, determining the second entries with the total combination probability larger than a second preset threshold value from the plurality of second entries, and constructing a target vocabulary according to the second entries larger than the second preset threshold value.

In an alternative embodiment, the program 305 is further configured to cause the processor 301, when calculating and obtaining the word string cohesion corresponding to each of the first entries: calculating to obtain a first probability of each entry in the first entries appearing in the vocabulary corpus to be constructed and a second probability of each character in each entry appearing in the vocabulary corpus to be constructed; and obtaining the word string aggregation degree corresponding to each of the plurality of first terms based on the first probability and the second probability.

In an alternative embodiment, the program 305 is further configured to cause the processor 301, when obtaining the word string aggregation degrees corresponding to each of the plurality of first terms based on the first probability and the second probability: and obtaining the word string aggregation degree corresponding to each of the first entries based on the ratio of the product of the first probability of each of the first entries and the second probabilities.

In an alternative embodiment, the program 305 is further configured to enable the processor 301 to obtain the word corresponding to each of the plurality of first terms based on a ratio of a first probability of each of the plurality of first terms and a product of a plurality of the second probabilitiesDegree of cluster aggregation: using formulas

Calculating and obtaining the word string aggregation degree corresponding to each entry in the first entries; wherein, I (S)₁) Representing an entry S₁＝c₁,c₂···c_nCorresponding string aggregation, p (c)₁)，p(c₂)...p(c_n) Represents the entry S₁Each character c in₁,c₂...c_nProbability of occurrence, p (c), respectively, in said vocabulary corpus to be constructed₁,c₂...c_n) Represents the entry S₁＝c₁,c₂···c_nThe probability of occurrence in the vocabulary corpus to be constructed is n is a positive integer, and represents the entry S₁＝c₁,c₂···c_nThe number of characters in (1).

In an alternative embodiment, the program 305 is further configured to cause the processor 301 to, when obtaining the total combination probability corresponding to each second term based on the string combination probability and the pronunciation combination probability corresponding to each second term in the plurality of second terms: calculating and obtaining the string combination probability corresponding to each entry in the second entries; calculating and obtaining the pronunciation combination probability corresponding to each entry in the second entries; and respectively carrying out weighted summation on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries to obtain the total combination probability corresponding to each second entry.

In an alternative embodiment, the program 305 is further configured to cause the processor 301, when calculating the string combination probability corresponding to each entry in the second plurality of entries: and performing n-power calculation on the probability that each entry in the second entries meets the lexical combination rule to obtain the string combination probability corresponding to each entry in the second entries, wherein n is a positive integer and represents the number of characters in the current entry.

In an alternative embodiment, the program 305 is further configured to cause the processor 301 to assign each of the plurality of second terms to a corresponding one of the plurality of second termsAnd when the probability that each entry accords with the lexical combination rule is calculated by the power of n, and the string combination probability corresponding to each entry in the second entries is obtained: using formulas

Calculating the string combination probability corresponding to each entry in the second entries; wherein, A (S)₂) Representing an entry S₂＝c₁,c₂...c_nCorresponding string combination probability, P' (c)₁,c₂...c_n) Representing an entry S₂＝c₁,c₂...c_nAnd the probability of meeting the lexical combination rule is met, and n is a positive integer and represents the number of characters in the entry.

In an alternative embodiment, the program 305 is further configured to cause the processor 301, when calculating the pronunciation combination probability corresponding to each of the plurality of second terms: and performing open n-th power calculation on the probability that the phoneme string corresponding to each entry in the second entries meets the pronunciation combination rule to obtain the pronunciation combination probability corresponding to each entry in the second entries, wherein the phoneme string is the pronunciation representation of the entry, and n is the number of phonemes in the phoneme string.

In an alternative embodiment, the program 305 is further configured to cause the processor 301 to, when the probability that the phoneme string corresponding to each of the second terms meets the pronunciation combination rule is calculated to be n-th-power, to obtain the pronunciation combination probability corresponding to each of the second terms: by the formula

Calculating the pronunciation combination probability corresponding to each entry in the second entries; wherein B (S)₂) Indicating a phoneme string S corresponding to each second entry₂＝x₁,x₂...x_nCorresponding pronunciation combination probability, P (x)₁,x₂...x_n) Representing a phoneme string S₂＝x₁,x₂...x_nProbability, x, of compliance with pronunciation combination rules₁,x₂...x_nAre respectively entry S₂Middle character c₁,c₂...c_nCorresponding phoneme being said entry S₂＝c₁,c₂...c_nWherein the pronunciation of each character is represented, and n is a positive integer and represents the number of phonemes contained in the phoneme string.

In an alternative embodiment, the program 305 is further configured to cause the processor 301 to, when performing weighted summation on the string combination probability and the pronunciation combination probability corresponding to each of the plurality of second terms to obtain a total combination probability corresponding to each of the second terms: using the formula T (S)₂)＝αA(S₂)+(1-α)B(S₂) Calculating a plurality of total combination probabilities corresponding to each second entry; wherein, T (S)₂) Representing the total combined probability corresponding to each second entry, alpha represents weight, alpha is more than or equal to 0 and less than or equal to 1, A (S)₂) Representing an entry S₂String combination probability of, B (S)₂) Representing an entry S₂The pronunciation combination probability of (1).

In an alternative embodiment, the string combination probability is weighted more heavily than the pronunciation combination probability.

In an alternative embodiment, the program 305 is further configured to cause the processor 301, when calculating the string combination probability corresponding to each entry in the second plurality of entries: and calculating and obtaining the combination probability of the word strings corresponding to each entry in the second entries through the first statistical language model.

In an alternative embodiment, the program 305 is further configured to cause the processor 301, when calculating the pronunciation combination probability corresponding to each of the plurality of second terms: and calculating and obtaining the pronunciation combination probability corresponding to each entry in the second entries through a second statistical language model.

In an alternative embodiment, the program 305 is further configured to enable the processor 301, when obtaining, from the vocabulary corpus to be constructed, a plurality of first terms satisfying a preset rule: acquiring a plurality of initial entries with different lengths from a vocabulary corpus to be constructed through a sliding window, wherein the size of the window is at least one character; counting the occurrence frequency of the plurality of initial entries with different lengths in the vocabulary corpus to be constructed; and acquiring a plurality of initial entries with the occurrence frequency larger than a third preset threshold value as the plurality of first entries.

In an alternative embodiment, the program 305 is further configured to enable the processor 301, when obtaining, from the vocabulary corpus to be constructed, a plurality of first terms satisfying a preset rule: and acquiring a plurality of initial entries with different lengths from the vocabulary corpus to be constructed through a sliding window to serve as a plurality of first entries, wherein the size of the window is at least one character.

For specific implementation of each step in the program 305, reference may be made to corresponding descriptions in corresponding steps in the foregoing embodiments of the word list construction method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

Through the electronic equipment of the embodiment, a plurality of first entries meeting preset rules are obtained from the vocabulary corpus to be constructed; calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries, of the first entries, with the word string aggregation degrees larger than a first preset threshold value as second entries, wherein the word string aggregation degrees indicate the degrees of the entries conforming to grammar rules; obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries; and according to the total combination probability corresponding to each second entry, determining the second entries with the total combination probability larger than a second preset threshold value from the plurality of second entries, and constructing a target vocabulary according to the second entries larger than the second preset threshold value. According to the scheme, the first entries are screened based on the word string aggregation degree, then the total combination probability corresponding to each second entry is obtained according to the word string combination probability and the pronunciation combination probability of the screened second entries, the target vocabulary is constructed based on the total combination probability, the combination capability of the entries on word construction is considered, the combination capability of the pronunciation of the entries is also considered, the accuracy of the constructed target vocabulary is higher, and the construction efficiency is higher.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code configured to perform the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program performs the above-described functions defined in the method in the embodiment of the present invention when executed by a Central Processing Unit (CPU). It should be noted that the computer readable medium in the embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access storage media (RAM), a read-only storage media (ROM), an erasable programmable read-only storage media (EPROM or flash memory), an optical fiber, a portable compact disc read-only storage media (CD-ROM), an optical storage media piece, a magnetic storage media piece, or any suitable combination of the foregoing. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In an embodiment of the invention, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code configured to carry out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may operate over any of a variety of networks: including a Local Area Network (LAN) or a Wide Area Network (WAN) -to the user's computer, or alternatively, to an external computer (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions configured to implement the specified logical function(s). In the above embodiments, specific precedence relationships are provided, but these precedence relationships are only exemplary, and in particular implementations, the steps may be fewer, more, or the execution order may be modified. That is, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an access module and a transmit module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

As another aspect, an embodiment of the present invention further provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the vocabulary constructing method described in the above embodiment.

As another aspect, an embodiment of the present invention further provides a computer-readable medium, which may be included in the apparatus described in the above embodiment; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a plurality of first entries meeting a preset rule from a vocabulary corpus to be constructed; calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries, of the first entries, with the word string aggregation degrees larger than a first preset threshold value as second entries, wherein the word string aggregation degrees represent the degrees of the entries meeting grammar rules; obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries; and according to the total combination probability corresponding to each second entry, determining the second entries with the total combination probability larger than a second preset threshold value from the plurality of second entries, and constructing a target vocabulary according to the second entries larger than the second preset threshold value.

The expressions "first", "second", "said first" or "said second" used in various embodiments of the invention may modify various components without relation to order and/or importance, but these expressions do not limit the respective components. The above description is only configured for the purpose of distinguishing elements from other elements.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention according to the embodiments of the present invention is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept described above. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present invention are mutually replaced to form the technical solution.

Claims

1. A vocabulary construction method, characterized in that the method comprises:

acquiring a plurality of first entries meeting preset rules from a vocabulary corpus to be constructed;

calculating and obtaining the word string aggregation degrees corresponding to the first entries, and using the first entries, of the first entries, with the word string aggregation degrees larger than a first preset threshold value as second entries, wherein the word string aggregation degrees represent the degrees of the entries meeting grammar rules;

obtaining a total combination probability corresponding to each second entry based on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries;

and according to the total combination probability corresponding to each second entry, determining the second entries with the total combination probability larger than a second preset threshold value from the plurality of second entries, and constructing a target vocabulary according to the second entries larger than the second preset threshold value.

2. The method according to claim 1, wherein said calculating the word string cohesion corresponding to each of the first entries comprises:

3. The method according to claim 2, wherein obtaining the aggregation of the strings corresponding to each of the first terms based on the first probability and the second probability comprises:

and obtaining the word string aggregation degree corresponding to each of the first entries based on the ratio of the product of the first probability of each of the first entries and the second probabilities.

4. The method according to claim 3, wherein obtaining the aggregation of the strings corresponding to each of the first terms based on a ratio of a first probability of each of the first terms to a product of the second probabilities comprises:

using formulas

5. The method according to claim 1, wherein obtaining the total combination probability for each of the second entries based on the string combination probability and the pronunciation combination probability for each of the second entries comprises:

calculating and obtaining the string combination probability corresponding to each entry in the second entries; calculating and obtaining the pronunciation combination probability corresponding to each entry in the second entries;

and respectively carrying out weighted summation on the string combination probability and the pronunciation combination probability corresponding to each second entry in the plurality of second entries to obtain the total combination probability corresponding to each second entry.

6. The method of claim 5, wherein said calculating a string combination probability corresponding to each of the second plurality of entries comprises:

and performing n-power calculation on the probability that each entry in the second entries meets the lexical combination rule to obtain the string combination probability corresponding to each entry in the second entries, wherein n is a positive integer and represents the number of characters in the current entry.

7. The method according to claim 6, wherein the calculating the n-th power of the probability that each of the second terms meets the lexical combination rule to obtain the string combination probability corresponding to each of the second terms comprises:

using formulas

8. The method of claim 5, wherein the calculating to obtain the pronunciation combination probability corresponding to each entry in the second entries comprises:

and performing open n-th power calculation on the probability that the phoneme string corresponding to each entry in the second entries meets the pronunciation combination rule to obtain the pronunciation combination probability corresponding to each entry in the second entries, wherein the phoneme string is the pronunciation representation of the entry, and n is the number of phonemes in the phoneme string.

9. The method of claim 8, wherein the calculating the n-th power of the probability that the phoneme string corresponding to each of the second terms meets the pronunciation combination rule to obtain the pronunciation combination probability corresponding to each of the second terms comprises:

by the formula

10. The method according to claim 5, wherein the weighted summation of the string combination probability and the pronunciation combination probability corresponding to each of the second entries to obtain a total combination probability corresponding to each of the second entries comprises:

using the formula T (S)₂)＝αA(S₂)+(1-α)B(S₂) Calculating the total combination probability corresponding to each second entry; wherein, T (S)₂) Representing the total combined probability corresponding to each second entry, alpha represents weight, alpha is more than or equal to 0 and less than or equal to 1, A (S)₂) Representing an entry S₂String combination probability of, B (S)₂) Representing an entry S₂The pronunciation combination probability of (1).

11. The method of claim 10, wherein the string combination probability is weighted more heavily than the pronunciation combination probability.

12. The method of claim 5, wherein said calculating a string combination probability corresponding to each of the second plurality of entries comprises:

and calculating and obtaining the string combination probability corresponding to each entry in the second entries through a first statistical language model.

13. The method of claim 5, wherein the calculating to obtain the pronunciation combination probability for each entry in the second entries comprises:

and calculating and obtaining the pronunciation combination probability corresponding to each entry in the second entries through a second statistical language model.

14. The method according to claim 1, wherein the obtaining a plurality of first terms satisfying a preset rule from the vocabulary corpus to be constructed comprises:

acquiring a plurality of initial entries with different lengths from a vocabulary corpus to be constructed through a sliding window, wherein the size of the window is at least one character;

counting the occurrence frequency of the plurality of initial entries with different lengths in the vocabulary corpus to be constructed;

and acquiring a plurality of initial entries with the occurrence frequency larger than a third preset threshold value as the plurality of first entries.

15. The method according to claim 1, wherein the obtaining a plurality of first terms satisfying a preset rule from the vocabulary corpus to be constructed comprises:

and acquiring a plurality of initial entries with different lengths from the vocabulary corpus to be constructed through a sliding window to serve as a plurality of first entries, wherein the size of the window is at least one character.

16. An electronic device, characterized in that the device comprises:

one or more processors;

a computer readable medium configured to store one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the vocabulary construction method of any of claims 1-15.

17. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the vocabulary construction method according to any of the claims 1-15.