WO2021051514A1

WO2021051514A1 - Speech identification method and apparatus, computer device and non-volatile storage medium

Info

Publication number: WO2021051514A1
Application number: PCT/CN2019/116920
Authority: WO
Inventors: 李秀丰
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-20
Filing date: 2019-11-10
Publication date: 2021-03-25
Also published as: CN110808032B; CN110808032A

Abstract

A speech identification method and apparatus, a computer device and a non-volatile storage medium, which relate to the technical field of artificial intelligence. The method comprises: acquiring speech information to be identified (201); inputting the speech information to be identified into a local first word graph model to perform decoding searching so as to obtain a first search result (202), the first search result comprising a first path and a corresponding first path score, and the first word graph model comprising an acoustic model, a pronunciation dictionary and a first word graph space; inputting the first search result into a local second word graph model for searching so as to obtain a second search result (203), the second search result comprising a second path and a corresponding second path score, the second word graph model comprising a second word graph space, and the first word graph space being a sub-word graph space of the second word graph space; and according to the second path score in the second search result, selecting the corresponding second path for output so as to obtain a speech identification result (204). By using the described method, the dimensions of searching are lowered and the amount of word graph searching is reduced, thus search time is shortened, and the speed of speech identification is increased.

Description

Speech recognition method, device, computer equipment and non-volatile storage medium

This application is based on the Chinese invention patent application filed on September 20, 2019 with the application number 201910894996.8, titled "A voice recognition method, device, computer equipment and storage medium", and claims its priority.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a voice recognition method, device, computer equipment and non-volatile storage medium.

Background technique

N-Gram is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, we call it the Chinese Language Model (CLM, Chinese Language Model). The Chinese language model uses the collocation information between adjacent words in the context to realize automatic conversion to Chinese characters.

The Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert consecutive pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (ie sentences), the one with the greatest probability can be calculated Sentences, so as to realize automatic conversion to Chinese characters, without the user's manual selection, avoiding the problem of multiple Chinese characters corresponding to the same pinyin (or stroke string or number string).

The model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words. The probability of the entire sentence is the product of the probability of each word, that is, the correlation N yuan The context.

At present, most mainstream speech recognition decoders have adopted a finite state machine (WFST)-based decoding network. The decoding network integrates language models, dictionaries, and acoustic shared phonetic character sets into a large decoding network. The path search needs to be combined The acoustic decoding dimension is searched, and the search volume is large.

Summary of the invention

The purpose of the embodiments of the present application is to provide a speech recognition method, which can reduce the search dimension of the decoding network, increase the search speed of the decoding network, and thereby increase the speed of speech recognition.

In order to solve the above technical problems, the embodiments of the present application provide a voice recognition method, which adopts the following technical solutions:

It includes the following steps:

Obtain the voice information to be recognized;

Input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result, the first search result includes the first path and the corresponding first path score, and the first word graph model includes the acoustic model , Pronunciation dictionary and first word image space;

Enter the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score. The second word graph model includes the second word graph. Space, the first word graph space is the sub word graph space of the second word graph space;

The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.

In order to solve the foregoing technical problems, an embodiment of the present application further provides a voice recognition device, including:

The acquiring module is used to acquire the voice information to be recognized;

The first search module is used to input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result. The first search result includes the first path and the corresponding first path score. A word graph model includes acoustic model, pronunciation dictionary and first word graph space;

The second search module is used to input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, where the second word The graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;

The output module is used to select the corresponding second path for output according to the second path score to obtain the voice recognition result.

In order to solve the above technical problems, the embodiments of the present application also provide a computer device, which adopts the following technical solutions:

The computer device includes a non-volatile memory and a processor, and executable code is stored in the non-volatile memory. When the processor executes the executable code, the processor implements any one of the proposals in the embodiments of the present application. The steps of a voice recognition method described in the item.

In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable non-volatile storage medium, which adopts the following technical solutions:

Executable code is stored on the computer-readable non-volatile storage medium, and when the executable code is executed by a processor, the steps of a speech recognition method according to any one of the embodiments of the present application are implemented.

Compared with the prior art, this application inputs the voice information to be recognized into a small word graph model for acoustic decoding and search, and then directly inputs the search results into a larger word graph model for search. The second search process There is no need to perform acoustic decoding, and the dimensionality of the search can be reduced, effectively reducing the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.

Description of the drawings

In order to explain the solution in this application more clearly, the following will briefly introduce the drawings used in the description of the embodiments of the application. Obviously, the drawings in the following description are some embodiments of the application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

Figure 1 is a schematic diagram of an exemplary system architecture to which the present application can be applied;

Fig. 2 is a schematic flowchart of a speech recognition method of the present application;

Fig. 3 is a schematic flowchart of another voice recognition method of the present application;

FIG. 4 is a schematic diagram of a specific flow of step 202 in the embodiment of FIG. 2 of the present application;

Fig. 5 is a schematic diagram of the process of constructing a first word graph model of the present application;

Fig. 6 is a schematic diagram of the construction process of another first word graph model of the present application;

FIG. 7 is a schematic diagram of a specific process of step 203 in the embodiment of FIG. 2 of the present application;

FIG. 8 is a schematic diagram of a specific process of step 204 in the embodiment of FIG. 2 of the present application;

FIG. 9 is a schematic structural diagram of a speech recognition device of the present application;

FIG. 10 is a schematic structural diagram of another voice recognition device of the present application;

FIG. 11 is a schematic diagram of a specific structure of the first search module 902;

FIG. 12 is a schematic structural diagram of another voice recognition device of the present application;

FIG. 13 is a schematic diagram of the specific structure of the first word graph model construction module 907;

FIG. 14 is a schematic diagram of a specific structure of the second search module 903;

FIG. 15 is a schematic diagram of a specific structure of the output module 904;

Fig. 16 is a block diagram of the basic structure of a computer device of the present application.

detailed description

Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.

Reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings.

As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.

The user can use the

terminal devices

101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Various communication client applications, such as web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the

terminal devices

101, 102, and 103.

The

terminal devices

101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving images) Experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.

The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the

terminal devices

101, 102, and 103.

It should be noted that a voice recognition method provided in the embodiments of the present application is generally executed by a terminal device. Correspondingly, a voice recognition device is generally provided in the terminal device.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are only illustrative, and any number of terminal devices, networks, and servers may be provided according to implementation needs.

Continuing to refer to FIG. 2, a flowchart of an embodiment of a voice recognition method according to the present application is shown. The above-mentioned speech recognition method includes the following steps:

Step 201: Acquire voice information to be recognized.

In this embodiment, an electronic device (such as the terminal device shown in FIG. 1) on which a voice recognition method runs can obtain the voice information to be recognized through a wired connection or a wireless connection. It should be pointed out that the above wireless connection methods can include but are not limited to 3G/4G connection, WiFi (Wireless-Fidelity) connection, Bluetooth connection, WiMAX (Worldwide Interoperability for Microwave Access) connection, Zigbee connection, UWB (ultra wideband) connection, And other wireless connection methods that are currently known or developed in the future.

Among them, the aforementioned voice information to be recognized can be collected through a microphone. The microphone can be set in the form of an external device or a built-in microphone in the device, for example, set in a voice recorder, mobile phone, tablet, MP4, notebook, etc. Microphone. Alternatively, the aforementioned voice information to be recognized may also be obtained by uploading by the user, for example, storing the collected voice in a storage device, and obtaining the corresponding voice information by reading data in the storage device. Or, the aforementioned voice information to be recognized may also be the voice information of the other party obtained when the user communicates through social software.

In a possible implementation manner, the voice information to be recognized may also be voice information that has undergone domain conversion, for example, voice information that has been converted into frequency domain through time domain.

The above-mentioned voice information may also be referred to as voice signal or voice data.

Step 202: Input the to-be-recognized speech information into a local first word graph model for decoding search, and obtain a first search result. The first search result includes the first path and the corresponding first path score. The first word graph model Including acoustic model, pronunciation dictionary and first word image space.

Wherein, the aforementioned local can be an offline environment under the Linux system, and offline speech tools in other scenarios can also be configured in the offline environment. The aforementioned speech information to be recognized is the speech information to be recognized obtained in step 201, and the aforementioned first The word graph model is a local word graph model. If the first word graph model is configured locally, the speech information can be decoded without going through the network, thereby improving the speed of speech recognition. The first word graph model can be a word graph model based on wfst. The first word graph model includes an acoustic model, a pronunciation dictionary, and a first word graph space. The above-mentioned acoustic model can acoustically decode the user's speech information, so that the speech information can be decoded. A phoneme unit is formed. The above pronunciation dictionary is used to combine phoneme units to form phoneme words. In the first word graph space mentioned above, each phoneme word is connected to form a path to form a language unit. The first word graph model is used to decode and search the speech information to be recognized. The first search result is the search result obtained in the first word graph space. The first search result includes multiple first paths, and each path includes a corresponding path score. The path score is used to indicate the credibility of the path, the higher the score, the more credible the path.

Among them, the path is the connection and connection weight of each phoneme word, such as:

How about today's (weight 0.9) weather (weight 0.8) (0.9), the path score is the product of all weights, 0.9*0.8*0.9=0.648

How about the recent weather (weight 0.3) and weather (weight 0.2) (weight 0.8), the path score is the product of all weights, 0.3*0.2*0.8=0.048.

The above-mentioned weights are obtained by training the first word graph model, and the training corpus can be the training corpus publicly available on the Internet, such as all the training corpus of the People's Daily from 2000 to 2012.

Step 203: Input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, wherein the second word graph model includes the first word graph model. Two word graph space, the first word graph space is the sub word graph space of the second word graph space.

In this embodiment, the first search result may be the first search result in step 202 or the nbest result. It should be noted that the acoustic model and dictionary are not configured in the second word graph model, and the first search result of the first word graph model is used as input, which can save the process of acoustic decoding. The second word graph model can be local The word graph model, the second word graph model is configured locally, and the voice information can be recognized without going through the network, thereby improving the speed of speech recognition. The second word graph model can be a word graph model based on wfst, and the second word graph space in the second word graph model can be a static word graph space. The above static word graph space means that it has been trained and the phoneme word weight remains unchanged In the word graph space, the first search result is searched through the static word graph network. The second search result is the search result obtained in the second word graph model. The second search result includes multiple second paths, and each path includes the corresponding The path score is used to indicate the credibility of the path. The higher the score, the more credible the path. The path score is the product of the weights of phonemes in the path. The weights of the phonemes can be obtained by training the second word graph model until the loss function is fitted.

Optionally, the second word graph space in the second word graph model can be sorted by the user, that is, the second word graph space in the second word graph model can be smaller than the traditional word graph network, reducing the complexity of the word graph network , Thereby increasing the speed of decoding search and increasing the real-time rate of decoding.

Step 204: Select a corresponding second path for output according to the second path score in the second search result, and obtain a voice recognition result.

In this embodiment, the second path includes a complete sentence composed of phoneme words and a corresponding path score. The path score is used to indicate the credibility of the sentence. The higher the path score, the credibility of the true content of the voice information. The higher the degree. The complete sentence corresponding to the second path with the highest path score can be selected for output, thereby obtaining a speech recognition result. In addition, multiple complete sentences corresponding to second paths with higher path scores can also be selected for output, thereby obtaining multiple voice recognition results for output, and the user can select from multiple voice recognition results.

In this embodiment, the voice information to be recognized is acquired; the voice information to be recognized is input into the local first word graph model for decoding and search, and the first search result is obtained. The first search result includes the first path and the corresponding first path Score, the first word graph model includes acoustic model, pronunciation dictionary and first word graph space; input the first search result into the local second word graph model to search, get the second search result, the second search result includes the second Paths and corresponding second path scores, where the second word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space; the corresponding second path is selected according to the second path score Output, get the result of speech recognition. By inputting the speech information to be recognized into a small word graph model for acoustic decoding and searching, and then directly inputting the search results into a larger word graph model for searching, the second search process does not require acoustic decoding, which can make the search The dimensionality becomes lower, which effectively reduces the amount of word map search, thereby reducing the search time and improving the speed of speech recognition.

Further, as shown in FIG. 3, before step 202, the above voice recognition method further includes:

Step 301: Acquire current context information of the user.

The above-mentioned current context information can be determined according to the time. For example, if the working hours are from 9 am to 17:00, the context can be determined as the working context, and at the end of the week, it can be determined as the vacation context. After 22:00, 8 Before the point, it can be determined as a resting context. It can also be determined based on the acquisition of the voice to be recognized. For example, the voice to be recognized is obtained from a friend on WeChat, it can be determined as a friend’s chat context, and the voice to be recognized is obtained from a user who notes as a customer in WeChat or other social software, then Can be determined as the working context. In a possible implementation manner, the user's context may also be automatically determined by the user, and the context information obtained by the user can be more accurate by selecting the context by himself.

Step 302: Select the corresponding first word graph model according to the user's current context information to decode and search the voice information.

In this embodiment, the above-mentioned first word graph model may be a first word graph model with context attributes, and each first word graph model corresponds to one or more context attributes, which can be obtained through step 301 The context information matches the corresponding first word graph model. Matching the context information to the corresponding first word graph model can make the results obtained by the first word graph model more suitable for the context and improve the accuracy.

Further, as shown in FIG. 4, the above-mentioned first search result is a path result of at least one path, and the voice information to be recognized is input into the local first word graph model for decoding search, and the steps of obtaining the first search result are detailed include:

Step 401: Obtain the path result of the first path and the corresponding first path score through decoding search.

Step 402: According to the first path score from high to low, m path results from n path results are sequentially selected for output, to obtain a first search result, where m is less than or equal to n.

In this embodiment, by decoding and searching the voice information in the first word graph model, the score of the search result (the first path) under the first word graph model can be obtained, that is, the score of at least one first path is scored. Yes, n search results (first path) correspond to n scores, and nbest results sorted according to the scores are obtained as the first search result.

For example: search the first word graph model for the speech information to be recognized as "what's the weather today", so that after the first word graph model is decoded, it will give 200 nbest decoding results:

How's the weather today

How is the weather in recent days

How about filling in today

…

Assuming a total of 200 nbest results;

If 200 nbest (200best) results are obtained through the first word graph model, 100 or all 200 nbest results can be selected as the first search result. At this time, n is 200 and m is 100.

In a possible implementation manner, the first search results may be sorted according to nbest scores, that is, the search results corresponding to the highest first path score are ranked first.

In this embodiment, by taking the m first search results in the nbest results as the input of the second word graph model, the input amount of the second word graph model can be reduced.

Further, as shown in FIG. 5, the construction of the above-mentioned first word graph model includes the following steps:

Step 501: Extract a word graph unit from the pre-built second word graph space, and construct a first word graph space according to the word graph unit.

Step 502: Construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.

Wherein, the second word graph space in the second word graph model may be configured through a local dictionary, or may be a word graph space pre-downloaded to the local. The word map unit may include a language unit and a corresponding weight, and the language unit may be understood as a phoneme word in the first search result. In a possible implementation, the word graph unit can also be understood as a word graph path. Specifically, according to the context attributes in the second word graph, word graph units with various context attributes can be extracted from the second word graph space to construct the first word graph space of different contexts, so that the voice information can be in the first word graph space. The search and decoding range in the word graph model becomes smaller, thereby improving the speed of the first word graph model to decode speech information. The above steps can be understood as pruning the second word graph space to obtain the first word graph space. It should be understood that the number of the aforementioned first word graph models may be one or more.

In addition, in another possible implementation manner, the first word graph space can be augmented to add word graph units with similar context attributes to expand the first word graph space into a second word graph space.

In addition, it should be noted that the weight of each language unit in the first word graph space obtained after pruning will change with the model training. The weight of the same language unit will be in the first word graph space and the second word graph space. The weight of the graph space is not the same. Similarly, the weight of each language unit in the second word graph space obtained after branching is different from the weight of the same language unit in the first word graph space. That is, the same path is searched in the first word graph model and the second word graph model, and the path scores obtained are different. such as:

In the first word graph model, how about today (weight 0.9) weather (weight 0.8) (0.9), the path score is the product of all weights, 0.9*0.8*0.9=0.648

In the second word graph model, how about today (weight 0.99) weather (weight 0.98) (0.99), the path score is the product of all weights, 0.9*0.8*0.9=0.960498.

In this embodiment, the first word graph space is constructed by extracting word graph units with the same attributes from the second word graph space, which can avoid the mismatch between the first search result and the second word graph model, causing errors. Recognition.

Further, as shown in FIG. 6, the construction of the above-mentioned first word graph model further includes the following steps:

Step 601: Train the first word graph model to fit the loss function to obtain the weight of the word graph unit in the first word graph space.

Among them, when the word graph unit is a language unit, the language unit that constructs the first word graph model can be combined according to the word graph combination relationship in the second word graph model, and the first word graph model can be trained to adjust the language unit the weight of. Get the new word graph space as the word graph space of the first word graph model. When the word graph unit is the second path, the scoring result of the first word graph path can be adjusted by training the first word graph model.

In this embodiment, when constructing the first word graph space by extracting word graph units in the second word graph space, the first word graph model can be trained to improve the recognition accuracy of the first word graph model In addition, it will not be affected by the second word graph space.

Further, as shown in FIG. 7, step 203 specifically includes:

Step 701: Extract the word graph unit in the first search result.

Step 702: Input the word graph unit in the first search result into the second word graph model for searching.

In this embodiment, when the word map unit is a language unit, the language unit can be input into the second word map model for search, and the second word map path of the corresponding word map unit in the second word map model and the corresponding Path score. When the word graph unit is the first word graph path, the first word graph path is decomposed in the second word graph model to obtain the language unit, and then the language unit is searched for the path in the second word graph space to obtain the second word graph space. The word map path and the corresponding path score. In addition, when the word graph unit is the first word graph path, these first word graph paths are input into the second word graph model, and the second word graph path in the second word graph space of the second word graph model is performed. Matching, since the same path in the first word map space and the second word map space may have different path scores, it is equivalent to wide-area verification of the first search result in the second word map space, ensuring the accuracy of the speech recognition results .

In this embodiment, the first search result is searched in the second word graph space in the form of a word graph unit, without acoustically decoding the speech information to be recognized, the search dimension is reduced, and the speed of speech recognition is improved.

Furthermore, as shown in FIG. 8, the above step 204 specifically includes:

Step 801: Sort the second path according to the score of the second path.

Step 802: Output the speech recognition results corresponding to y second paths in order, where y is greater than or equal to 1.

Among them, the second path with a high score can be ranked first, and the second path with a low score can be ranked behind. In this way, the complete sentence corresponding to the second word map path selected for output will be more intuitive. For example, if only one is selected for output, the complete sentence corresponding to the first second word map path can be extracted for output. , In the case of selecting multiple output, the top ones can be extracted for output, so that the user can select the output result.

In this embodiment, the second path is sorted and then output. According to the complete sentences output by sorting, the output voice recognition result can be more convenient and intuitive.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through executable code, which can be stored in a computer readable non-volatile storage. In the medium, when the executable code is executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned non-volatile storage medium may be a non-volatile non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), and the like.

It should be understood that although the various steps in the flowchart of the drawings are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least part of the steps in the flowchart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

With further reference to FIG. 9, as an implementation of the method shown in FIG. 2, this application provides an embodiment of a speech recognition device. The device embodiment corresponds to the method embodiment shown in FIG. Used in various electronic devices.

As shown in FIG. 9, a speech recognition device 900 of this embodiment includes: a first acquisition module 901, a first search module 902, a second search module 903, and an output module 904. among them:

The first acquiring module 901 is configured to acquire voice information to be recognized;

The first search module 902 is configured to input the to-be-recognized voice information into the local first word graph model for decoding search to obtain a first search result, the first search result including the first path and the corresponding first path score, The first word graph model includes acoustic model, pronunciation dictionary and first word graph space;

The second search module 903 is configured to input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, where the second The word graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;

The output module 904 is configured to select a corresponding second path for output according to the second path score to obtain a voice recognition result.

Further, referring to FIG. 10, the first word graph model is at least one first word graph model configured locally, and the speech recognition apparatus 900 further includes: a second acquisition module 905 and a selection module 906. among them,

The second obtaining module 905 is used to obtain the current context information of the user;

The selection module 906 is configured to select the corresponding first word graph model to decode and search the voice information according to the user's current context information.

Further, referring to FIG. 11, the first search result is a path result of at least one path, and the first search module 902 includes: a decoding search unit 9021, an output unit 9022. among them,

The decoding search unit 9021 is configured to obtain the path result of the first path and the corresponding first path score through decoding search;

The first output unit 9022 is configured to sequentially select m path results among the n path results according to the first path score from high to low for output to obtain the first search result, where m is less than or equal to n.

Further, referring to FIG. 12, the speech recognition device 900 further includes a first word graph model construction module 907, and the first word graph model construction module 907 includes; a first extraction unit 9071, a construction unit 9072. among them:

The first extraction unit 9071 is configured to extract the word graph unit from the pre-built second word graph space, and construct the first word graph space according to the word graph unit;

The construction unit 9072 is configured to construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space.

Further, referring to FIG. 13, the first word graph model construction module 907 further includes a training unit 9073. among them:

The training unit 9073 trains the first word graph model, trains to fit the loss function, and obtains the weight of the word graph unit in the first word graph space.

Further, referring to FIG. 14, the second search module 903 includes: a second extraction unit 9031, an input unit 9032. among them:

The second extraction unit 9031 is used to extract the word map unit in the first search result;

The input unit 9032 is configured to input the word graph unit in the first search result into the second word graph model for search.

Further, referring to FIG. 15, the output module 904 includes: a sorting unit 9041, a second output unit 9042. among them:

The sorting unit 9041 is configured to sort the second path according to the score of the second path;

The second output unit 9042 is configured to output the voice recognition results corresponding to the y second paths in order, where y is greater than or equal to 1.

The voice recognition device provided in the embodiment of the present application can implement the various implementation manners in the method embodiments of FIG. 2 to FIG. 8 and the corresponding beneficial effects. To avoid repetition, details are not described herein again.

In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 16 for details. FIG. 16 is a block diagram of the basic structure of the computer device in this embodiment.

The computer device 16 includes a non-volatile memory 161, a processor 162, and a network interface 163 that are communicatively connected to each other through a system bus. It should be pointed out that only the computer device 16 with components 161-163 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable GateArray, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer equipment can be computing equipment such as desktop computers, notebooks, palmtop computers, and cloud servers. The computer device can interact with the user through a keyboard, mouse, remote control, touch panel, or voice control device.

The non-volatile memory 161 includes at least one type of readable non-volatile storage medium. The readable non-volatile storage medium includes flash memory, hard disk, multimedia card, card-type non-volatile memory (for example, SD or DX). Non-volatile memory, etc.), read-only non-volatile memory (ROM), electrically erasable programmable read-only non-volatile memory (EEPROM), programmable read-only non-volatile memory (PROM), magnetic Non-volatile memory, magnetic disks, optical disks, etc. In some embodiments, the non-volatile memory 161 may be an internal storage unit of the computer device 16, such as a hard disk or memory of the computer device 16. In other embodiments, the non-volatile memory 161 may also be an external storage device of the computer device 16, such as a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the computer device 16. (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the non-volatile memory 161 may also include both the internal storage unit of the computer device 16 and its external storage device. In this embodiment, the non-volatile memory 161 is generally used to store an operating system and various application software installed in the computer device 16, such as executable code of a voice recognition method. In addition, the non-volatile memory 161 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 162 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 162 is generally used to control the overall operation of the computer device 16. In this embodiment, the processor 162 is configured to run executable codes or process data stored in the non-volatile memory 161, for example, run executable codes for a voice recognition method.

The network interface 163 may include a wireless network interface or a wired network interface, and the network interface 163 is generally used to establish a communication connection between the computer device 16 and other electronic devices.

This application also provides another implementation manner, that is, a computer-readable non-volatile storage medium is provided. The computer-readable non-volatile storage medium stores a type of speech recognition executable code, and the above-mentioned type of speech recognition executable code is The code may be executed by at least one processor, so that the at least one processor executes the steps of a speech recognition method as described above.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage medium (such as ROM, magnetic A disc, an optical disc) includes a number of instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute a voice recognition method of the various embodiments of the present application.

Obviously, the embodiments described above are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. This application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of this application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims

A speech recognition method, characterized in that it comprises the following steps:

Obtain the voice information to be recognized;

Input the to-be-recognized speech information into the local first word graph model for decoding and search to obtain a first search result. The first search result includes a first path and a corresponding first path score. The first word graph The model includes acoustic model, pronunciation dictionary and first word image space;

The first search result is input into the local second word graph model for searching, and the second search result is obtained. The second search result includes a second path and a corresponding second path score, wherein the second word graph The model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;

The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
The speech recognition method according to claim 1, wherein the first word graph model is at least one first word graph model configured locally, and the first word graph model is correspondingly trained with context attributes, Before the step of inputting the voice information to be recognized into the first local word graph model for decoding and searching, the method further includes:

Get the user's current context information;

According to the user's current context information, the corresponding first word graph model is selected to decode and search the voice information.
The speech recognition method according to claim 1, wherein the first search result includes at least one path result of a first path, and the input of the speech information to be recognized into a local first word graph model is performed. The steps of decoding the search to obtain the first search result include:

Obtain the path result of the first path and the corresponding first path score through decoding search;

According to the first path score from high to low, m path results from n path results are selected and outputted to obtain the first search result, where m is less than or equal to n.
The speech recognition method according to claim 1, wherein the method for constructing the first word graph model comprises the following steps:

Extracting the word graph unit from the pre-built second word graph space, and constructing the first word graph space according to the word graph unit;

The first word graph model is constructed according to the acoustic model, pronunciation dictionary, and first word graph space.
The speech recognition method according to claim 4, wherein the construction of the first word graph model further comprises the following steps:

The first word graph model is trained to fit the loss function, and the weight of the word graph unit in the first word graph space is obtained.
The speech recognition method according to claim 4, wherein the step of inputting the first search result into a local second word graph model for searching comprises:

Extract the word map unit in the first search result;

Input the word graph unit in the first search result into the second word graph model for searching.
The speech recognition method according to any one of claims 1 to 6, wherein the step of selecting a corresponding second path for output according to the second path score, and obtaining a speech recognition result specifically includes:

Sort the second path according to the score of the second path;

Output the speech recognition results corresponding to y second paths in order, where y is greater than or equal to 1.
A speech recognition device is characterized in that it comprises:

The acquiring module is used to acquire the voice information to be recognized;

The first search module is used to input the to-be-recognized speech information into the local first word graph model for decoding search, and obtain the first search result. The first search result includes the first path and the corresponding first path score. A word graph model includes acoustic model, pronunciation dictionary and first word graph space;

The second search module is used to input the first search result into the local second word graph model for searching, and obtain the second search result. The second search result includes the second path and the corresponding second path score, where the second word The graph model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;

The output module is used to select the corresponding second path for output according to the second path score to obtain the voice recognition result.
The speech recognition device according to claim 8, further comprising:

The second acquisition module is used to acquire the current context information of the user;

The selection module is used to select the corresponding first word graph model to decode and search the voice information according to the user's current context information.
The speech recognition device according to claim 8, further comprising:

The first extraction unit is configured to extract the word graph unit from the second word graph space constructed in advance, and construct the first word graph space according to the word graph unit;

The construction unit is used to construct the first word graph model according to the acoustic model, pronunciation dictionary, and first word graph space;

The training unit trains the first word graph model, trains it to fit the loss function, and obtains the weight of the word graph unit in the first word graph space.
A computer device includes a non-volatile memory and a processor. The non-volatile memory stores computer-readable instructions, and is characterized in that: when the processor executes the computer-readable instructions, the following steps are implemented:

Obtain the voice information to be recognized;

Input the to-be-recognized speech information into the local first word graph model for decoding and search to obtain a first search result. The first search result includes a first path and a corresponding first path score. The first word graph The model includes acoustic model, pronunciation dictionary and first word image space;

The first search result is input into the local second word graph model for searching, and the second search result is obtained. The second search result includes a second path and a corresponding second path score, wherein the second word graph The model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;

The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
The computer device according to claim 11, wherein the first word graph model is at least one first word graph model configured locally, and the first word graph model is correspondingly trained with context attributes, Before the step of inputting the voice information to be recognized into the first local word graph model for decoding and searching, the processor further includes the following steps when executing the computer-readable instruction:

Get the user's current context information;

According to the user's current context information, the corresponding first word graph model is selected to decode and search the voice information.
The computer device according to claim 11, wherein the first search result includes at least one path result of a first path, and the voice information to be recognized is input into a local first word graph model The steps of performing a decoding search to obtain the first search result include:

Obtain the path result of the first path and the corresponding first path score through decoding search;

According to the first path score from high to low, m path results from n path results are selected and outputted to obtain the first search result, where m is less than or equal to n.
The computer device according to claim 11, wherein the construction of the first word graph model comprises the following steps:

Extracting the word graph unit from the pre-built second word graph space, and constructing the first word graph space according to the word graph unit;

The first word graph model is constructed according to the acoustic model, pronunciation dictionary, and first word graph space.

The first word graph model is trained to fit the loss function, and the weight of the word graph unit in the first word graph space is obtained.
The computer device according to claim 11, wherein the step of inputting the first search result into a local second word graph model for searching comprises:

Extract the word map unit in the first search result;

Input the word graph unit in the first search result into the second word graph model for searching. 16. A non-volatile computer-readable storage medium, characterized in that computer-readable instructions are stored on the computer-readable non-volatile storage medium, and the computer-readable instructions are implemented when executed by a processor The steps of the speech recognition method are as follows:

Obtain the voice information to be recognized;

Input the to-be-recognized speech information into the local first word graph model for decoding and search to obtain a first search result. The first search result includes a first path and a corresponding first path score. The first word graph The model includes acoustic model, pronunciation dictionary and first word image space;

The first search result is input into the local second word graph model for searching, and the second search result is obtained. The second search result includes a second path and a corresponding second path score, wherein the second word graph The model includes a second word graph space, and the first word graph space is a sub-word graph space of the second word graph space;

The corresponding second path is selected and output according to the second path score in the second search result, and the speech recognition result is obtained.
The computer-readable non-volatile storage medium according to claim 16, wherein the first word graph model is at least one first word graph model configured locally, and the first word graph model Correspondingly trained with context attributes, before the step of inputting the to-be-recognized speech information into the local first word graph model for decoding and searching, the computer-readable instruction is executed by the processor and further implements the following steps:

Get the user's current context information;

According to the user's current context information, the corresponding first word graph model is selected to decode and search the voice information.
The computer-readable non-volatile storage medium according to claim 16, wherein the first search result includes at least one path result of a first path, and the voice information to be recognized is input into the local The steps of performing decoding search in the first word graph model of, and obtaining the first search result include:

Obtain the path result of the first path and the corresponding first path score through decoding search;

According to the first path score from high to low, m path results from n path results are selected and outputted to obtain the first search result, where m is less than or equal to n.
The computer-readable non-volatile storage medium according to claim 16, wherein the construction of the first word graph model comprises the following steps:

Extracting the word graph unit from the pre-built second word graph space, and constructing the first word graph space according to the word graph unit;

Constructing the first word graph model according to the acoustic model, the pronunciation dictionary, and the first word graph space;

The first word graph model is trained to fit the loss function, and the weight of the word graph unit in the first word graph space is obtained.
18. The computer-readable non-volatile storage medium according to claim 19, said inputting said first search result into a local second word graph model for searching specifically comprises:

Extract the word map unit in the first search result;

Input the word graph unit in the first search result into the second word graph model for searching.