CN112002310A

CN112002310A - Domain language model construction method and device, computer equipment and storage medium

Info

Publication number: CN112002310A
Application number: CN202010669031.1A
Authority: CN
Inventors: 张旭华; 齐欣; 孙泽明; 朱林林; 王宁
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2020-11-27
Anticipated expiration: 2040-07-13
Also published as: WO2022012238A1; CN112002310B

Abstract

The invention discloses a domain language model construction method, a device, computer equipment and a storage medium, belonging to the technical field of voice recognition, wherein the method comprises the following steps: converting the generic language model to an equivalent first WFSA network; screening an optimal path meeting preset conditions from the first WFSA network according to a preset number of field linguistic data to construct a second WFSA network; and normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model. Under the condition of insufficient domain training corpus, the method can quickly construct the domain language model which meets the specific scene and has the general generalization capability.

Description

Domain language model construction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and a device for constructing a domain language model, computer equipment and a storage medium.

Background

Most speech recognition schemes are based on language models. When training a language model, the most commonly used model is an N-Gram model, which is a statistical language model, and generally, the larger the corpus is, the better the model effect is. With the continuous deepening of scenes, various language models meeting the requirements of specific scenes and having generalization capability are often required to be made, which puts higher requirements on the selection of linguistic data.

At present, there are two common methods for constructing a language model satisfying a specific scenario, one is to directly train through collecting related domain corpora, and the other is to fuse the trained language model with a general language model according to a certain weight to increase generalization ability, but both methods require a large amount of domain corpora, but finding a domain corpus fitting a scenario is not easy.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for constructing a domain language model, a computer device, and a storage medium, which are capable of quickly constructing a domain language model that satisfies a specific scenario and has a general generalization capability under the condition of insufficient domain training corpus.

In a first aspect, a method for constructing a domain language model is provided, where the method includes:

converting the generic language model to an equivalent first WFSA network;

screening an optimal path meeting preset conditions from the first WFSA network according to a preset number of field linguistic data to construct a second WFSA network;

and normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model.

Further, the screening out an optimal path meeting a preset condition from the first WFSA network according to a preset number of domain corpora to construct a second WFSA network includes:

searching a preset number of candidate optimal paths in the first WFSA network aiming at each domain corpus; and

screening out optimal paths corresponding to the domain corpora from the preset number of candidate optimal paths, wherein the probability of the transmitting arc of each state node of the optimal paths exceeds a preset threshold;

and constructing the second WFSA network according to the optimal path corresponding to each field corpus.

Further, the searching a preset number of candidate optimal paths in the first WFSA network for each of the domain corpora includes:

inputting the domain linguistic data into the first WFSA network for searching aiming at each domain linguistic data to obtain a plurality of candidate paths corresponding to the domain linguistic data and path probability of each candidate path;

and sequencing a plurality of candidate paths corresponding to the field corpus according to the order of the path probability from high to low, and taking the candidate paths sequenced in the front by a preset number of bits as the optimal candidate paths of the field corpus.

Further, the normalizing the second WFSA network includes:

normalizing the probabilities on all the transmitting arcs of each state node in the second WFSA network according to the number of the transmitting arcs of each state node in the second WFSA network and the probabilities on the transmitting arcs.

Further, the general language model and the domain language model are both N-Gram language models.

In a second aspect, an apparatus for constructing a domain language model is provided, the apparatus comprising:

the first conversion module is used for converting the universal language model into an equivalent first WFSA network;

the construction module is used for screening out an optimal path meeting preset conditions from the first WFSA network according to the field linguistic data with preset number so as to construct a second WFSA network;

a normalization module, configured to normalize the second WFSA network;

and the second conversion module is used for converting the normalized second WFSA network into a domain language model.

Further, the construction module includes:

the search submodule is used for searching a preset number of candidate optimal paths in the first WFSA network aiming at each field corpus;

the screening submodule is used for screening the optimal paths corresponding to the domain corpora from the preset number of candidate optimal paths, wherein the probability of the transmitting arc of each state node of the optimal paths exceeds a preset threshold value;

and the construction submodule is used for constructing the second WFSA network according to the optimal path corresponding to each field corpus.

Further, the search sub-module is specifically configured to:

Further, the normalization module is specifically configured to:

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the following steps are implemented:

converting the generic language model to an equivalent first WFSA network;

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, performs the steps of:

converting the generic language model to an equivalent first WFSA network;

The invention provides a method and a device for constructing a domain language model, computer equipment and a storage medium, wherein a general language model is converted into an equivalent first WFSA network; then, according to a preset number of field linguistic data, screening out an optimal path meeting a preset condition from the first WFSA network to construct a second WFSA network; and finally, normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model, wherein the path for constructing the second WFSA network is screened from the first WFSA network converted from the general language model, and is screened according to the preset number of domain linguistic data, so that the domain language model obtained by converting the normalized second WFSA network can meet the requirement of a specific scene and has general generalization capability, and the purpose of quickly constructing the domain language model which meets the specific scene and has the general generalization capability under the condition that the domain linguistic data are insufficient is realized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a domain language model building method according to an embodiment of the present invention;

fig. 2 is a detailed flowchart of step S2 shown in fig. 1;

FIG. 3 is a block diagram illustrating a domain language model building apparatus according to an embodiment of the present invention;

fig. 4 shows an internal structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is to be understood that, unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to". Furthermore, in the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

As described in the foregoing background art, there are generally two methods for constructing a language model satisfying a specific scenario, one method is to directly collect related domain corpora for training, and the other method is to fuse the trained language model with a general language model according to a certain weight to increase generalization capability, and both the two methods require a large amount of training corpora, but it is not easy to find a corpus fitting a scenario. The domain language model in the embodiment of the present invention may be applied to a scenario in a specific domain, where the specific domain may be a financial domain, a medical domain, a commodity domain, a logistics domain, or other specific domains, and the present invention is not limited in this respect.

Fig. 1 shows a flowchart of a domain language model building method according to an embodiment of the present invention, which is illustrated by taking a domain language model building apparatus as an execution subject, where the apparatus may be configured in any computer device, and the computer device may be an independent server or a server cluster.

Referring to fig. 1, the method for constructing a domain language model provided by the present invention includes steps S1 to S4:

s1: the generic language model is converted into an equivalent first WFSA network.

Wherein the generic language model may be based on a statistical language model, which is a probability distribution over a sequence of words, which for a given length m of the sequence may yield a probability P (w) for the entire sequence₁,w₂,...,w_m). The essence of this is that an attempt is made to find a probability distribution for a sentence or sequence that can represent the probability of the occurrence of any one sentence or sequence, and the probability that the current sequence is characterized by a conditional probability is typically related to the n sequences that occurred before. N-Gram is an algorithm based on a statistical language model based on markov assumptions, namely: assume that the nth word appears in a piece of text only in relation to the first N-1 words and not in relation to any other words. Based on such an assumption, the probability of each word in the text can be evaluated, and the probability of the whole sentence is the product of the probabilities of the respective words. These probabilities can be obtained by counting the number of simultaneous occurrences of N words directly from the corpus, commonly used N-Gram models such as binary Bi-Gram and ternary Tri-Gram.

The universal language model can be generated by training with a universal corpus in advance, the universal corpus can be obtained by capturing Chinese corpuses from the internet or directly downloading a public free Chinese corpus through a web crawler tool, and the storage format of the universal language model can be an arpa format. It should be noted that the reason why the universal language model is used instead of other domain models is that the universal language model does not emphasize any domain and is a relatively smooth probability set calculated on a large number of historical text corpora, so that the universal language model is easier to migrate into a target domain, and can reflect a word connection probability close to reality.

The first WFSA network is a directed graph structure, a plurality of state nodes are arranged on the graph, connecting arcs are arranged among the state nodes, the arcs represent transitions among the states, the arcs are directional, and each arc is provided with an input label and probability corresponding to the state transition. Wherein, the input label is a word object; the probability of an arc characterizing the probability of the arc appearing in the path. The first WFSA network may include a plurality of paths, and a probability of each path may be calculated according to a product of probabilities on all arcs in the path, wherein when the probability is expressed as a weight on an arc between state nodes, the weight value may be calculated by taking a logarithm of the probability.

Specifically, when converting the generic language model in the arpa format into a first WFSA (Weighted Finite-State Automata) network, the execution body may call the arpa2fst tool to convert the generic language model into an equivalent first WFSA network. Of course, in practical applications, the arpa2fst tool may be called to perform conversion, and an equivalent first WFSA network may also be obtained through conversion in other ways, which is not specifically limited in this embodiment.

S2: and screening the optimal path meeting the preset conditions from the first WFSA network according to the preset number of field linguistic data to construct a second WFSA network.

The domain linguistic data can be common words and sentences, professional words and sentences and the like in a specific domain.

The preset number may be preset to be lower than the preset value, and it is understood that the number of samples of the domain corpus of the preset number is smaller than that of the general corpus.

In this embodiment, a plurality of paths may be respectively searched in the first WFSA network for each domain corpus, and one or more optimal paths satisfying the preset condition corresponding to each domain corpus may be obtained through screening, for example, the optimal path may be a path with the highest path probability, and the word sequence corresponding to each domain corpus may be obtained according to the optimal path corresponding to each domain corpus.

Wherein the preset condition is a preset condition for determining an optimal path. In a specific application, the preset condition may be set as: when the probability of the transmitting arc of each state node on one path exceeds a preset threshold, the path is an optimal path, and in addition, the preset condition can be set as: when the sum of the probabilities of all the transmitting arcs passed by one path exceeds a preset threshold value, the path is an optimal path.

Specifically, as shown in fig. 2, the implementation process of step S2 may include the steps of:

s21: and searching a preset number of candidate optimal paths corresponding to the domain linguistic data in the first WFSA network aiming at each domain linguistic data.

The preset number may be set as an integer value according to actual needs, and the specific preset number is not limited in this embodiment.

Specifically, the process may include:

inputting the domain linguistic data into a first WFSA network for searching aiming at each domain linguistic data to obtain a plurality of candidate paths corresponding to the domain linguistic data and path probabilities of the candidate paths; and sequencing a plurality of candidate paths corresponding to the field linguistic data from high to low according to the path probability, and taking the candidate paths sequenced in the front preset number as the optimal candidate paths of the field linguistic data.

For example, assuming that "weather today is really good" is input into the first WFSA network as a domain corpus, the following two candidate optimal paths can be searched:

PATH1: < s > weather today is really good >

PATH2 < s > weather today is really good.

S22: and screening out the optimal paths corresponding to the field linguistic data from the preset number of candidate optimal paths, wherein the probability of the transmitting arc of each state node of the optimal paths exceeds a preset threshold value.

In this embodiment, after one or more candidate optimal paths corresponding to a given domain corpus are searched, when the probability of the emission arc of each state node on one candidate optimal path exceeds a preset threshold, the candidate optimal path is the optimal path corresponding to the domain corpus.

S23: and constructing a second WFSA network according to the optimal path corresponding to each field corpus.

Specifically, a second WFSA network that only includes the initial state node and the end state node may be pre-constructed, and after each optimal path corresponding to one field corpus is obtained, the optimal path is updated to the second WFSA network until the optimal path corresponding to the last field corpus is updated to the second WFSA network, that is, the second WFSA network is constructed.

S3: the second WFSA network is normalized.

Specifically, the probabilities on all the transmission arcs of each state node in the second WFSA network are normalized according to the number of transmission arcs of each state node in the second WFSA network and the probabilities on the respective transmission arcs, so that the sum of the probabilities on all the transmission arcs of each state node in the second WFSA network is 1.

S4: and converting the normalized second WFSA network into a domain language model.

And when the general language model is the N-Gram model, the domain language model is the N-Gram model with the same order as the general language model.

Specifically, the execution main body can convert the text of the second WFSA network into the N-Gram model in the arpa format by calling the fsts-to-bridges tool, and then the domain language model is obtained. In addition, the fsts-to-bridges tool may be called to perform transformation, and a domain language model may also be obtained through transformation in other manners, which is not limited in this embodiment.

The invention provides a domain language model construction method, which comprises the steps of converting a general language model into an equivalent first WFSA network; then, according to a preset number of field linguistic data, screening out an optimal path meeting a preset condition from the first WFSA network to construct a second WFSA network; and finally, normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model, wherein the path for constructing the second WFSA network is screened from the first WFSA network converted from the general language model, and is screened according to the preset number of domain linguistic data, so that the domain language model converted from the normalized second WFSA network can meet the requirements of a specific scene and has general generalization capability, and the purpose of quickly constructing the language model which meets the specific scene and has the general generalization capability under the condition of insufficient training linguistic data is achieved.

Fig. 3 is a block diagram illustrating a domain language model building apparatus according to an embodiment of the present invention, and referring to fig. 3, the apparatus includes:

a first conversion module 31 for converting the generic language model into an equivalent first WFSA network;

a constructing module 32, configured to screen an optimal path meeting a preset condition from the first WFSA network according to a preset number of domain corpora, so as to construct a second WFSA network;

a normalization module 33, configured to normalize the second WFSA network;

and a second conversion module 34, configured to convert the normalized second WFSA network into a domain language model.

In one embodiment, the construction module 32 includes:

the searching submodule 321 is configured to search, for each domain corpus, a preset number of candidate optimal paths corresponding to the domain corpus in the first WFSA network;

the screening submodule 322 is configured to screen out an optimal path corresponding to the domain corpus from a preset number of candidate optimal paths, where a probability on a transmission arc of each state node of the optimal path exceeds a preset threshold;

and the constructing submodule 323 is used for constructing a second WFSA network according to the optimal path corresponding to each field corpus.

In one embodiment, the search submodule 321 is specifically configured to:

inputting the domain linguistic data into a first WFSA network for searching aiming at each domain linguistic data to obtain a plurality of candidate paths corresponding to the domain linguistic data and path probabilities of the candidate paths;

and sequencing a plurality of candidate paths corresponding to the field linguistic data from high to low according to the path probability, and taking the candidate paths sequenced in the front preset number as the optimal candidate paths of the field linguistic data.

In one embodiment, the normalization module 33 is specifically configured to:

normalizing the probabilities on all the transmitting arcs of each state node in the second WFSA network according to the number of the transmitting arcs of each state node in the second WFSA network and the probabilities on the respective transmitting arcs, so that the sum of the probabilities on all the transmitting arcs of each state node in the second WFSA network is 1.

In one embodiment, the generic language model and the domain language model are both N-Gram language models.

The device for constructing the domain language model provided by the embodiment of the invention belongs to the same inventive concept as the method for constructing the domain language model provided by the embodiment of the invention, can execute the method for constructing the domain language model provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the method for constructing the domain language model. For technical details that are not described in detail in this embodiment, reference may be made to the domain language model construction method provided in this embodiment of the present invention, and details are not described here again.

Fig. 4 shows an internal structure diagram of a computer device according to an embodiment of the present invention. The computer device may be a server, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a domain language model building method.

In one embodiment, there is provided a computer device comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs when executed by the one or more processors cause the one or more processors to perform the computer program to perform the steps of:

converting the generic language model to an equivalent first WFSA network;

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of:

converting the generic language model to an equivalent first WFSA network;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, physical sub-tables, or other media used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for constructing a domain language model, the method comprising:

converting the generic language model to an equivalent first WFSA network;

2. The method of claim 1, wherein the screening out an optimal path meeting a preset condition from the first WFSA network according to a preset number of domain corpora to construct a second WFSA network comprises:

3. The method according to claim 2, wherein the searching out a preset number of candidate optimal paths in the first WFSA network for each of the domain corpora comprises:

4. The method of any of claims 1 to 3, wherein the normalizing the second WFSA network comprises:

5. The method of claim 1, wherein the generic language model and the domain language model are both N-Gram language models.

6. A domain language model building apparatus, the apparatus comprising:

a normalization module, configured to normalize the second WFSA network;

7. The apparatus of claim 6, wherein the configuration module comprises:

8. The apparatus of claim 7, wherein the search submodule is specifically configured to:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for domain language model construction according to any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the domain language model construction method according to any one of claims 1 to 5.