CN112002310B

CN112002310B - Domain language model construction method, device, computer equipment and storage medium

Info

Publication number: CN112002310B
Application number: CN202010669031.1A
Authority: CN
Inventors: 张旭华; 齐欣; 孙泽明; 朱林林; 王宁
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2024-03-26
Anticipated expiration: 2040-07-13
Also published as: WO2022012238A1; CN112002310A

Abstract

The invention discloses a method, a device, computer equipment and a storage medium for constructing a domain language model, which belong to the technical field of voice recognition, wherein the method comprises the following steps: converting the universal language model into an equivalent first WFSA network; screening optimal paths meeting preset conditions from the first WFSA network according to the preset number of domain corpora to construct a second WFSA network; normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model. According to the method, the domain language model meeting specific scenes and having universal generalization capability can be quickly constructed under the condition of insufficient domain training corpus.

Description

Domain language model construction method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a method and apparatus for constructing a domain language model, a computer device, and a storage medium.

Background

Speech recognition schemes are mostly language model based recognition schemes. When training a language model, the most commonly used model is an N-Gram model, which is a statistical language model, and generally, the larger the corpus is, the better the model effect is. Along with the continuous deep of scenes, various language models meeting the requirements of specific scenes and having generalization capability are often required to be made, and higher requirements are provided for corpus selection.

At present, two common methods for constructing a language model meeting a specific scene generally exist, one method is to directly train by collecting relevant domain corpuses, the other method is to fuse the trained language model with a general language model according to a certain weight to increase generalization capability, and the two methods need a large amount of domain training corpuses, but finding the domain corpuses of a fitting scene is not easy.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides a method, a device, computer equipment and a storage medium for constructing a domain language model, which can quickly construct the domain language model meeting specific scenes and having universal generalization capability under the condition of insufficient domain training corpus.

In a first aspect, a method for constructing a domain language model is provided, where the method includes:

converting the universal language model into an equivalent first WFSA network;

screening optimal paths meeting preset conditions from the first WFSA network according to a preset number of domain corpora to construct a second WFSA network;

normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model.

Further, the selecting, according to the preset number of domain corpora, an optimal path satisfying a preset condition from the first WFSA network to construct a second WFSA network includes:

searching a preset number of candidate optimal paths in the first WFSA network aiming at each domain corpus; and

screening out optimal paths corresponding to the domain corpus from the preset number of candidate optimal paths, wherein the probability on the transmitting arcs of each state node of the optimal paths exceeds a preset threshold;

and constructing the second WFSA network according to the optimal path corresponding to each domain corpus.

Further, for each of the domain corpora, searching a preset number of candidate optimal paths in the first WFSA network, including:

inputting the domain corpus into the first WFSA network for searching aiming at each domain corpus to obtain a plurality of candidate paths corresponding to the domain corpus and path probabilities of the candidate paths;

and sequencing the plurality of candidate paths corresponding to the domain corpus according to the sequence of the path probability from high to low, and taking the candidate paths sequenced in the preset number as the candidate optimal paths of the domain corpus.

Further, the normalizing the second WFSA network includes:

and normalizing the probability of all the transmitting arcs of each state node in the second WFSA network according to the transmitting arc number of each state node in the second WFSA network and the probability of each transmitting arc.

Further, the general language model and the domain language model are both N-Gram language models.

In a second aspect, there is provided a domain language model construction apparatus, the apparatus comprising:

the first conversion module is used for converting the universal language model into an equivalent first WFSA network;

the construction module is used for screening out an optimal path meeting preset conditions from the first WFSA network according to the preset number of domain corpora so as to construct a second WFSA network;

the normalization module is used for normalizing the second WFSA network;

and the second conversion module is used for converting the normalized second WFSA network into a domain language model.

Further, the construction module includes:

the searching sub-module is used for searching a preset number of candidate optimal paths in the first WFSA network aiming at each domain corpus;

the screening sub-module is used for screening out optimal paths corresponding to the domain corpus from the preset number of candidate optimal paths, wherein the probability of each state node of the optimal paths on an emission arc exceeds a preset threshold;

and the construction submodule is used for constructing the second WFSA network according to the optimal path corresponding to each domain corpus.

Further, the searching submodule is specifically configured to:

Further, the normalization module is specifically configured to:

In a third aspect, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of:

converting the universal language model into an equivalent first WFSA network;

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of:

converting the universal language model into an equivalent first WFSA network;

The invention provides a method, a device, computer equipment and a storage medium for constructing a domain language model, which are used for converting a general language model into an equivalent first WFSA network; then, according to the preset number of domain corpora, the optimal path meeting the preset conditions is screened out from the first WFSA network so as to construct a second WFSA network; finally, normalizing the second WFSA network, converting the normalized second WFSA network into a domain language model, and quickly constructing the domain language model which meets the specific scene and has the universal generalization capability under the condition of insufficient domain training corpus because the path for constructing the second WFSA network is screened out from the first WFSA network converted from the universal language model and is screened for the preset number of domain corpuses.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a flowchart of a method for building a domain language model according to an embodiment of the present invention;

FIG. 2 is a specific flowchart of step S2 shown in FIG. 1;

FIG. 3 is a diagram showing a construction apparatus for a domain language model according to an embodiment of the present invention;

fig. 4 shows an internal structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It is noted that, unless the context clearly requires otherwise, the words "comprise," "comprising," and the like throughout the specification and the claims should be construed in an inclusive sense rather than an exclusive or exhaustive sense; that is, it is the meaning of "including but not limited to". Furthermore, in the description of the present invention, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

As described in the foregoing background art, there are two general methods for constructing a language model meeting a specific scene, one is to directly train by collecting relevant domain linguistic data, and the other is to fuse the trained language model with a general language model according to a certain weight to increase generalization capability, and the two methods both require a large amount of training linguistic data, but finding the linguistic data of a fitting scene is not easy. The domain language model in the embodiment of the present invention may be applied to a scenario of a specific domain, where the specific domain may be a financial domain, a medical domain, a commodity domain, a logistics domain, or other specific domains, which is not specifically limited in the present invention.

Fig. 1 shows a flowchart of a domain language model building method according to an embodiment of the present invention, where the embodiment of the present invention uses a domain language model building apparatus as an execution body, and the apparatus may be configured in any computer device, and the computer device may be an independent server or a server cluster.

Referring to fig. 1, the method for constructing a domain language model provided by the present invention includes steps S1 to S4:

s1: the generic language model is converted to an equivalent first WFSA network.

Wherein the generic language model may be based on a statistical language model, which is a probability distribution over a sequence of words, which for a given length m of the sequence may produce a probability P (w ₁ ,w ₂ ,...,w _m ). The essence is to try to find a probability distribution for a sentence or sequence that can represent the probability of the occurrence of any one sentence or sequence, typically using conditional probabilities to characterize the probability of the current sequence as related to the n sequences that occur before. N-Gram is an algorithm based on a statistical language model based on the markov assumption, namely: it is assumed that in a piece of text, the occurrence of the nth word is related to only the first N-1 words, but not to any other words. Based on such an assumption, the probability of each word occurring in the text can be evaluated, and the probability of an entire sentence is the product of the probabilities of the respective words occurring. These probabilities can be obtained by counting the number of simultaneous occurrences of N words directly from the corpus, commonly used N-Gram models such as binary Bi-Gram and ternary Tri-Gram.

The universal language model can be generated by training a universal corpus in advance, the universal corpus can be obtained by capturing Chinese corpus from the Internet through a web crawler tool or directly downloading the public free Chinese corpus, and the storage format of the universal language model can be an ara format. It should be noted that, updating the generic language model after training is time-consuming, generally only once, and aims to cover more comprehensive language phenomena, and the reason that such a generic language model covering more comprehensive fields is used instead of using other field models is that the generic language model does not pay attention to any field and is a relatively smooth probability set calculated on a large amount of historical text corpus, so that the generic language model is easier to migrate to the target field and can reflect the connection probability of words close to reality.

The first WFSA network is a directed graph structure, a plurality of state nodes are arranged on the graph, connection arcs are arranged among the state nodes, the arcs represent transitions among the states, the arcs are directed, and each arc is provided with an input label and a probability corresponding to the state transition. The input label is a word object; the probability over an arc characterizes the probability that the arc appears in the path. The first WFSA network may include a plurality of paths, and the probability of each path may be calculated according to a probability product of all arcs in the path, where when the probabilities are represented by weights on arcs between state nodes, the weight value may be obtained by logarithmically calculating the probabilities.

Specifically, when converting the universal language model in the arpa format into a first WFSA (Weighted fixed-State automation) network, the execution subject may call an arpa2fst tool to convert the universal language model into an equivalent first WFSA network. Of course, in practical application, in addition to calling the arpa2fst tool to perform conversion, an equivalent first WFSA network may be obtained through conversion in other manners, which is not limited in this embodiment.

S2: and screening the optimal paths meeting the preset conditions from the first WFSA network according to the preset number of domain corpora to construct a second WFSA network.

The domain corpus can be common words and sentences, professional words and sentences and the like in a specific domain.

The preset number may be preset to be lower than the preset value, and it is understood that the number of samples of the preset number of domain corpora is smaller than the general corpus.

In this embodiment, multiple paths may be searched in the first WFSA network for each domain corpus, and one or more optimal paths meeting the preset conditions corresponding to each domain corpus may be obtained through screening, for example, the optimal paths may be paths with highest path probability, and word sequences corresponding to each domain corpus may be obtained according to the optimal paths corresponding to each domain corpus.

The preset conditions are preset conditions for determining the optimal path. In a specific application, the preset conditions may be set as: when the probability on the transmitting arc of each state node on one path exceeds a preset threshold, the path is the optimal path, and in addition, the preset condition can be set as follows: when the sum of probabilities on all the transmitting arcs of one path exceeds a preset threshold value, the path is the optimal path.

Specifically, as shown in fig. 2, the implementation procedure of step S2 may include the steps of:

s21: aiming at each domain corpus, searching a preset number of candidate optimal paths corresponding to the domain corpus in the first WFSA network.

The preset number may be set to an integer value according to actual needs, and the specific preset number is not limited in this embodiment.

Specifically, the process may include:

inputting the domain corpus into a first WFSA (wireless Fidelity sa) network for searching aiming at each domain corpus to obtain a plurality of candidate paths corresponding to the domain corpus and path probabilities of the candidate paths; and sequencing the multiple candidate paths corresponding to the domain corpus according to the sequence of the path probability from high to low, and taking the candidate paths sequenced in the preset number as the candidate optimal paths of the domain corpus.

By way of example, assuming that "today weather is good" is entered as a domain corpus into the first WFSA network, the following two candidate optimal paths may be searched:

PATH1 < s > weather is good today

PATH2 < s > weather is good today.

S22: and screening out optimal paths corresponding to the domain corpus from a preset number of candidate optimal paths, wherein the probability of each state node of the optimal paths on an emission arc exceeds a preset threshold value.

In this embodiment, after one or more candidate optimal paths corresponding to a given domain corpus are searched, when the probability on the transmitting arc of each state node on one candidate optimal path exceeds a preset threshold, the candidate optimal path is the optimal path corresponding to the domain corpus.

S23: and constructing a second WFSA network according to the optimal path corresponding to the corpus in each field.

Specifically, a second WFSA network only including an initial state node and an end state node may be pre-constructed, and after an optimal path corresponding to a domain corpus is obtained, the optimal path is updated to the second WFSA network until an optimal path corresponding to a last domain corpus is updated to the second WFSA network, so that the construction of the second WFSA network is completed.

S3: the second WFSA network is normalized.

Specifically, according to the number of transmitting arcs of each state node on the second WFSA network and the probability on each transmitting arc, the probability on all transmitting arcs of each state node in the second WFSA network is normalized, so that the sum of the probabilities on all transmitting arcs of each state node in the second WFSA network is 1.

S4: and converting the normalized second WFSA network into a domain language model.

When the general language model is an N-Gram model, the domain language model is the N-Gram model with the same order as the general language model.

Specifically, the execution body can call the fsts-to-transgressions tool to convert the text into the N-Gram model in the arpa format after the second WFSA network, so as to obtain the domain language model. In addition, besides calling the fsts-to-transgressions tool to transform, the domain language model may be obtained through other transformation methods, which is not limited in this embodiment.

The invention provides a field language model construction method, which is characterized in that a general language model is converted into an equivalent first WFSA network; then, according to the preset number of domain corpora, the optimal path meeting the preset conditions is screened out from the first WFSA network so as to construct a second WFSA network; finally, normalizing the second WFSA network, converting the normalized second WFSA network into a domain language model, and quickly constructing the language model which meets the specific scene and has the universal generalization capability under the condition of insufficient training corpus because the path for constructing the second WFSA network is screened out from the first WFSA network converted from the universal language model and is screened for the preset number of domain corpuses.

Fig. 3 shows a block diagram of a domain language model construction device according to an embodiment of the present invention, and referring to fig. 3, the device includes:

a first conversion module 31, configured to convert the generic language model into an equivalent first WFSA network;

a construction module 32, configured to screen out an optimal path satisfying a preset condition from the first WFSA network according to a preset number of domain corpora, so as to construct a second WFSA network;

a normalization module 33, configured to normalize the second WFSA network;

and the second conversion module 34 is configured to convert the normalized second WFSA network into a domain language model.

In one embodiment, the construction module 32 includes:

a searching sub-module 321, configured to search, for each domain corpus, a preset number of candidate optimal paths corresponding to the domain corpus in the first WFSA network;

the screening sub-module 322 is configured to screen an optimal path corresponding to the domain corpus from a preset number of candidate optimal paths, where a probability on an emission arc of each state node of the optimal path exceeds a preset threshold;

and a constructing sub-module 323, configured to construct a second WFSA network according to the optimal path corresponding to the corpus in each domain.

In one embodiment, the search sub-module 321 is specifically configured to:

inputting the domain corpus into a first WFSA (wireless Fidelity sa) network for searching aiming at each domain corpus to obtain a plurality of candidate paths corresponding to the domain corpus and path probabilities of the candidate paths;

and sequencing the multiple candidate paths corresponding to the domain corpus according to the sequence of the path probability from high to low, and taking the candidate paths sequenced in the preset number as the candidate optimal paths of the domain corpus.

In one embodiment, the normalization module 33 is specifically configured to:

and normalizing the probability of all the transmitting arcs of each state node in the second WFSA network according to the transmitting arc number of each state node in the second WFSA network and the probability of each transmitting arc, so that the sum of the probabilities of all the transmitting arcs of each state node in the second WFSA network is 1.

In one embodiment, the generic language model and the domain language model are both N-Gram language models.

The domain language model construction device provided by the embodiment of the invention belongs to the same inventive concept as the domain language model construction method provided by the embodiment of the invention, and the domain language model construction method provided by any embodiment of the invention can be executed, and has the corresponding functional modules and beneficial effects of executing the domain language model construction method. Technical details not described in detail in this embodiment may refer to the method for constructing a domain language model provided in the embodiment of the present invention, and are not described herein again.

Fig. 4 shows an internal structural diagram of a computer device according to an embodiment of the present invention. The computer device may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a domain language model building method.

In one embodiment, there is provided a computer device comprising:

one or more processors;

a storage means for storing one or more programs;

the following steps are implemented when one or more programs are executed by one or more processors, causing the one or more processors to execute the computer program:

converting the universal language model into an equivalent first WFSA network;

screening optimal paths meeting preset conditions from the first WFSA network according to the preset number of domain corpora to construct a second WFSA network;

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

converting the universal language model into an equivalent first WFSA network;

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, physical banking or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for building a domain language model, the method comprising:

converting the universal language model into an equivalent first WFSA network;

normalizing the second WFSA network, and converting the normalized second WFSA network into a domain language model;

the filtering, according to a preset number of domain corpora, an optimal path meeting a preset condition from the first WFSA network to construct a second WFSA network, including:

2. The method of claim 1, wherein the searching a preset number of candidate optimal paths in the first WFSA network for each of the domain corpora comprises:

3. The method of claim 1 or 2, wherein normalizing the second WFSA network comprises:

4. The method of claim 1, wherein the generic language model and the domain language model are both N-Gram language models.

5. A domain language model construction apparatus, the apparatus comprising:

the normalization module is used for normalizing the second WFSA network;

the second conversion module is used for converting the normalized second WFSA network into a domain language model;

wherein the construction module comprises:

6. The apparatus of claim 5, wherein the search submodule is specifically configured to:

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the domain language model construction method of any one of claims 1 to 4 when the computer program is executed.

8. A computer-readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the domain language model construction method of any one of claims 1 to 4.