CN115104151A

CN115104151A - Offline voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN115104151A
Application number: CN202080003684.4A
Authority: CN
Inventors: 郝吉芳; 宿绍勋; 王炳乾
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-09-23
Also published as: WO2022134025A1

Abstract

An offline speech recognition method and apparatus (400), an electronic device, and a readable storage medium. The off-line voice recognition method comprises the steps of obtaining a voice signal and converting the voice signal into text data (101); identifying a target intent (102) of the text data; extracting key information associated with the target intention from the text data, the key information matching one of a plurality of preset information (103); and determining a control instruction (104) corresponding to the voice signal according to the key information and the target intention. By acquiring the target intention of the voice signal and acquiring the key information corresponding to the target intention, the control instruction of the voice signal is determined, and the voice signal can be recognized without depending on a background server, so that offline equipment which is not networked can also recognize the voice, and the application range of the voice recognition is widened.

Description

Offline voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to an offline speech recognition method and apparatus, an electronic device, and a readable storage medium.

Background

Speech recognition refers to a process of analyzing an input speech signal to obtain the meaning of the speech signal expression. In the related art, voice recognition is performed by depending on a network, and the electronic device needs to be in communication connection with a background server through the network so as to realize a voice recognition function through the background server.

Disclosure of Invention

In a first aspect, an embodiment of the present disclosure provides an offline speech recognition method, including the following steps:

acquiring a voice signal, and converting the voice signal into text data;

identifying a target intent of the text data;

extracting key information associated with the target intention in the text data, wherein the key information is matched with one of a plurality of preset information;

and determining a control instruction corresponding to the voice signal according to the key information and the target intention.

Optionally, the identifying the target intention of the text data includes:

converting the text data into a digital vector through a pre-trained conversion model;

identifying semantic information corresponding to the digital vector;

determining a matching degree between the semantic information and a plurality of preset intents;

and taking the preset intention with the highest matching degree with the semantic information as a target intention corresponding to the text data.

Optionally, the preset intention includes at least one of network connection control, power-off control, volume adjustment, brightness adjustment, and signal source adjustment.

Optionally, the extracting key information associated with the target intention in the text data includes:

determining the preset information matched with the target intention in the plurality of preset information according to the target intention;

marking a plurality of vocabularies included in the text data, and determining the matching degree of each vocabulary and each preset information;

taking the vocabulary with the highest matching degree with the preset information as a target vocabulary containing the key information;

and acquiring information included in the target vocabulary as the key information.

Optionally, the acquiring a voice signal and converting the voice signal into text data includes:

acquiring an input voice signal;

carrying out noise reduction processing on the voice signal to obtain a first signal;

converting the first signal into a first text through a pre-trained text conversion model;

and correcting abnormal data existing in the first text to obtain text data corresponding to the voice signal.

In a second aspect, an embodiment of the present disclosure provides an offline speech recognition apparatus, including:

the acquisition conversion module is used for acquiring a voice signal and converting the voice signal into text data;

an intention recognition module for recognizing a target intention of the text data;

the key information extraction module is used for extracting key information which is associated with the target intention in the text data, and the key information is matched with one of a plurality of preset information;

and the control instruction determining module is used for determining a control instruction corresponding to the voice signal according to the key information and the target intention.

Optionally, the intention identifying module includes:

the vector conversion submodule is used for converting the text data into a digital vector through a pre-trained conversion model;

the semantic information identifying submodule is used for identifying semantic information corresponding to the digital vector;

the intention matching submodule is used for determining the matching degree between the semantic information and a plurality of preset intentions;

and the intention determining submodule is used for taking the preset intention with the highest matching degree with the semantic information as the target intention corresponding to the text data.

Optionally, the key information extracting module includes:

the preset information determining submodule is used for determining the preset information which is correspondingly matched with the target intention in the plurality of preset information according to the target intention;

the marking submodule is used for marking a plurality of vocabularies included in the text data and determining the matching degree of each vocabulary and each preset information;

the target vocabulary determining submodule is used for taking the vocabulary with the highest matching degree with the preset information as the target vocabulary containing the key information;

and the key information acquisition submodule is used for acquiring information included in the target vocabulary as the key information.

Optionally, the obtaining and converting module includes:

the acquisition submodule is used for acquiring an input voice signal;

the noise reduction submodule is used for carrying out noise reduction processing on the voice signal to obtain a first signal;

the text conversion sub-module is used for converting the first signal into a first text through a pre-trained text conversion model;

and the correction submodule is used for correcting abnormal data existing in the first text to obtain text data corresponding to the voice signal.

In a third aspect, the disclosed embodiments provide an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the offline speech recognition method according to any one of the first aspect.

In a fourth aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the offline speech recognition method according to any one of the first aspect.

The embodiment of the disclosure converts a voice signal into text data by acquiring the voice signal; identifying a target intent of the text data; extracting key information associated with the target intention from the text data; and determining a control instruction corresponding to the voice signal according to the key information and the target intention. Therefore, the target intention of the voice signal is obtained, the key information corresponding to the target intention is obtained, the control instruction of the voice signal is determined, and the voice signal can be recognized without depending on a background server, so that offline equipment which is not networked can also recognize the voice signal, and the application range of the voice recognition is widened.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of an offline speech recognition method provided by an embodiment of the present disclosure;

fig. 2 is a schematic view of a scene of an offline speech recognition method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of an offline speech recognition method according to an embodiment of the present disclosure;

fig. 4 is a block diagram of an offline speech recognition apparatus according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The embodiment of the disclosure provides an off-line voice recognition method.

The technical solution of this embodiment is applied to an electronic device, and it should be noted that the offline voice recognition in this embodiment refers to voice recognition without depending on network resources. The electronic device can be in an off-line state or an on-line state. The off-line state refers to that the electronic equipment is not in data connection with external equipment through a wireless hotspot, a mobile data network and other modes; the online state refers to that the electronic device is in communication connection with other devices through a wireless hotspot, a mobile data network or other ways.

In this embodiment, the offline voice recognition process does not depend on external data of the electronic device, and it can be understood that the voice recognition process in the embodiment of the present disclosure can be implemented no matter the electronic device is in an offline state or an online state.

As shown in fig. 1, in one embodiment, the offline speech recognition method includes the steps of:

step 101: acquiring a voice signal, and converting the voice signal into text data.

As shown in fig. 2, the voice signal in this embodiment refers to a voice signal input by a user to an electronic device, and in implementation, the input voice signal may be collected by a remote controller with a voice collection function, a microphone, or a voice collection device carried by the electronic device.

After the collected speech signal, the speech signal is further converted into text.

In one embodiment, the step 101 specifically includes:

acquiring an input voice signal;

As shown in fig. 3, after the input speech signal is acquired, a noise reduction process is performed on the speech signal, where the noise reduction process is to remove noise, and the noise specifically includes external noise and internal noise. The external noise refers to noise from outside the electronic device, such as environmental noise, and the like, and the internal noise refers to music played by the electronic device itself, noise generated by an application program running by itself, and the like. The external noise can be realized by filtering, spectral subtraction, wiener filtering, deep learning noise reduction and other methods, and the internal noise can be realized by performing corresponding echo cancellation according to the sound played by the electronic equipment.

After the noise reduction processing, a first signal of relatively high quality can be obtained.

Next, the first signal is converted into a first text. In this embodiment, the process of speech recognition mainly includes extracting features of speech, and establishing a speech template required for speech recognition based on the features.

In the recognition process, the established voice template is compared with the characteristics of the input first signal by using a text conversion model for voice recognition, and the voice template with the highest matching degree with the first signal is found out according to a certain search and matching strategy. The recognition result for the first signal can then be given by a table look-up according to the definition of this template.

The training of the text conversion model is completed in advance, and during implementation, signal processing and knowledge mining can be performed on a voice database and a language database which are collected in advance, so that an acoustic model and a language model which are required by a voice recognition system are obtained, the text conversion model meeting the use requirement is obtained, and then the text conversion model is arranged in the electronic equipment.

In the application process, the text conversion model is utilized to identify the user input signal. It should be noted that the user input signal may refer to the speech signal or the first signal subjected to the noise reduction processing.

It is to be understood that the process of converting a speech signal into a first text may be understood as including two main processes of noise reduction processing and text recognition.

The noise reduction processing mainly can realize the end point detection to remove redundant mute and non-speaking sound, noise reduction, feature extraction and the like; the text recognition mainly utilizes trained acoustic models and language models to perform statistical pattern recognition on feature vectors of user speaking, and can also be called decoding, so as to obtain the text information contained in the text recognition.

In some embodiments, an adaptive feedback process may be further included after the text recognition, and the feedback process is mainly used for self-learning the speech of the user, so as to perform necessary "correction" on the "acoustic model" and the "speech model", and further improve the recognition accuracy.

After obtaining the first text, correction may also be performed on the obtained text content, e.g., correcting incorrect homophones, e.g., correcting the companion eye to companion glasses; correcting words with similar pronunciation, such as correcting wandering girls into cowherd girls; correcting certain specific nouns according to the lexicon, such as correcting Wudy Allen to Ailun Wudy; correction of grammatical errors, for example to change imagination into imagination; word completion, e.g., to correct love being happy as if love being happy; misshapen characters, such as correcting sorghum into sorghum, etc. The process can be implemented based on specific rules or by using a corresponding deep learning model, and obviously, the specific rules can be further expanded.

In this embodiment, the corrected first text is used as the text corresponding to the input speech signal. In other embodiments, the above-mentioned noise reduction process and the text correction step are not necessary, and may be omitted as needed to reduce the system load in the speech recognition process.

Step 102: a target intent of the text data is identified.

As shown in fig. 3, after the text data is obtained, the target intention corresponding to the text data is identified, and the process can be understood as classifying the text data, determining the meaning of the expression of the text data and the purpose to be achieved specifically.

In some of these embodiments, this step 102 includes:

identifying semantic information corresponding to the digital vector;

determining a degree of matching between the semantic information and a plurality of preset intentions;

In this embodiment, the process of intent recognition may be implemented based on a Bert model. The conversion model of the Bert framework is a model for generating word vectors through pre-training, namely, a text of a natural language is converted into a digital vector, and then corresponding semantic information is identified, so that the generalization capability of the word vector model can be increased, and the character-level, the word-level, the sentence-level and even the inter-sentence relation characteristics can be fully described. Obviously, the intention identification process may also be implemented by regular expression matching, a similarity calculation model based on blstm, and the like, and is not further limited herein.

In some of these embodiments, intent recognition may be implemented by a softmax classifier, for example, a classification function y may be set ⁱ ＝softmax(W ⁱ h ₁ +b ⁱ ) Wherein, y ⁱ Probability of intention being classified into class i, W ⁱ Is a weight, h ₁ As a data set, b ⁱ Is a bias vector. The softmax algorithm itself is referred to in the related art and is not further defined and described herein.

After obtaining the semantic information corresponding to the digital vector, determining the matching degree between the semantic information and a plurality of preset intentions.

It should be understood that, since the technical solution of this embodiment is used for implementing offline speech recognition, and is limited by factors such as hardware performance, the computing capability that can be provided is limited, and therefore, in this embodiment, a certain number of preset intents are set, and the speech recognition and control functions are mainly provided for these preset intents.

In one embodiment, the electronic device may be a conference kiosk, a smart screen, a home appliance, or the like, as shown in fig. 3. The preset intents include at least one of a network connection control, a power-off control, a volume adjustment, a brightness adjustment, and a signal source adjustment.

More specifically, in one embodiment, only the five preset intentions are set, in the process of voice recognition, the recognized semantic information is matched with the preset intentions, and the preset intention with the highest matching degree is selected as the target intention corresponding to the text data, so that the calculation amount is reduced, and the accuracy of result recognition is improved.

Step 103: and extracting key information associated with the target intention in the text data, wherein the key information is matched with one of a plurality of preset information.

After the target intention is determined, the key information in the text data is extracted, in this embodiment, one or more pieces of preset information corresponding to each preset intention setting are matched, and in implementation, whether the corresponding key information exists is searched from the text data.

Illustratively, in one embodiment, the text data obtained from the voice signal is "adjust volume to 60", the target intention corresponding to the voice signal is obtained by intention recognition as volume adjustment, the preset information corresponding to the volume adjustment includes four types of volume increase, volume decrease, mute, and adjustment to a specified volume, after the text data is obtained, whether key information matching the preset information exists is recognized from the text data, in this embodiment, "60" is recognized and matches "adjustment to specified volume" in the preset information, and therefore, "60" is taken as corresponding key information.

In some embodiments, the step 103 specifically includes:

In some embodiments, the key information may be obtained by slot filling. In this embodiment, after the target intention is determined, preset information that matches the target intention is determined from among a plurality of pieces of preset information.

Illustratively, the preset information corresponding to volume adjustment is volume increase, volume decrease, mute and adjustment to a specified volume, the preset information corresponding to brightness adjustment is brightness increase and brightness decrease, and when it is determined that the target intention is volume adjustment, the preset information matched with the intention is four preset information of volume increase, volume decrease, mute and adjustment to a specified volume.

Next, a plurality of words included in the text data are labeled, for example, for "turn volume to 60", the labeled words may be "turn volume", "turn volume" and "60", and in this process, part or all of the words in the text data may be labeled.

After the labeling of the vocabulary is completed, a degree of matching between the vocabulary and the preset information is determined. For example, in the present embodiment, the matching degrees of the four words "will", "volume", "adjust to", and "60" with the four pieces of preset information "volume up", "volume down", "mute", and "adjust to specified volume" are determined one by one, respectively.

In this embodiment, the matching degree between "60" and "adjusted to the specified volume" is the highest, and therefore, the word "60" is taken as the target word, and the information contained in "60" is further acquired as the specific volume value 60, and the information is taken as the key information.

Similar to the above process, the degree of matching of each vocabulary with the preset information can be calculated by methods including, but not limited to, the softmax algorithm described above.

Step 104: and determining a control instruction corresponding to the voice signal according to the key information and the target intention.

After the intention and the key information are obtained, a corresponding control instruction is determined, for example, in the present embodiment, the intention is volume adjustment, and the key information is specifically volume value size 60, so that a corresponding control instruction can be obtained to adjust the volume to 60.

As shown in fig. 2 and 3, after the control instruction is determined, the electronic device may be further controlled to execute the control instruction to adjust the volume to 60.

Therefore, the target intention of the voice signal is obtained, the key information corresponding to the target intention is obtained, the control instruction of the voice signal is determined, and the voice signal can be recognized without depending on a background server, so that offline equipment which is not networked can also recognize the voice signal, and the application range of the voice recognition is widened.

In addition, the technical scheme of the embodiment can be realized without a network, and the corresponding speed is higher, the cost is lower and the use is more convenient compared with the online voice recognition based on the background server.

The embodiment of the disclosure provides an off-line voice recognition device.

As shown in fig. 4, in one embodiment, the offline speech recognition apparatus 400 includes:

an obtaining and converting module 401, configured to obtain a voice signal and convert the voice signal into text data;

an intent recognition module 402 for recognizing a target intent of the text data;

a key information extraction module 403, configured to extract key information associated with the target intent from the text data;

a control instruction determining module 404, configured to determine a control instruction corresponding to the voice signal according to the key information and the target intention.

In some of these embodiments, the intent recognition module 402 includes:

the semantic information identification submodule is used for identifying semantic information corresponding to the digital vector;

In some of these embodiments, the preset intent includes at least one of a network connection control, a power off control, a volume adjustment, a brightness adjustment, and a signal source adjustment.

In some embodiments, the key information extraction module 403 comprises:

In some embodiments, the obtaining conversion module 401 includes:

the acquisition submodule is used for acquiring an input voice signal;

The offline speech recognition apparatus in this embodiment can implement the steps of the above-mentioned offline speech recognition method embodiment, and can implement substantially the same or similar technical effects, which are not described herein again.

An embodiment of the present disclosure further provides a mobile terminal, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program, when executed by the processor, implements each process of the above-mentioned embodiment of the offline speech recognition method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiments of the present disclosure further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned offline speech recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the detailed description is omitted here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present disclosure.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

An off-line speech recognition method comprising the steps of:

acquiring a voice signal, and converting the voice signal into text data;

identifying a target intent of the text data;

extracting key information associated with the target intention in the text data, wherein the key information is matched with one of a plurality of preset information;

and determining a control instruction corresponding to the voice signal according to the key information and the target intention.
The method of claim 1, wherein the identifying a target intent of the text data comprises:

converting the text data into a digital vector through a pre-trained conversion model;

identifying semantic information corresponding to the digital vector;

determining a degree of matching between the semantic information and a plurality of preset intentions;

and taking the preset intention with the highest matching degree with the semantic information as a target intention corresponding to the text data.
The method of claim 2, wherein the preset intents include at least one of network connection control, power-off control, volume adjustment, brightness adjustment, and signal source adjustment.
The method of claim 2 or 3, wherein the extracting key information in the text data associated with the target intent comprises:

determining the preset information matched with the target intention in the plurality of preset information according to the target intention;

marking a plurality of vocabularies included in the text data, and determining the matching degree of each vocabulary and each preset information;

taking the vocabulary with the highest matching degree with the preset information as a target vocabulary containing the key information;

and acquiring information included in the target vocabulary as the key information.
The method of claim 1, wherein the obtaining a speech signal and converting the speech signal to text data comprises:

acquiring an input voice signal;

carrying out noise reduction processing on the voice signal to obtain a first signal;

converting the first signal into a first text through a pre-trained text conversion model;

and correcting abnormal data existing in the first text to obtain text data corresponding to the voice signal.
An offline speech recognition apparatus comprising:

the acquisition conversion module is used for acquiring a voice signal and converting the voice signal into text data;

an intention recognition module for recognizing a target intention of the text data;

the key information extraction module is used for extracting key information which is associated with the target intention in the text data, and the key information is matched with one of a plurality of preset information;

and the control instruction determining module is used for determining a control instruction corresponding to the voice signal according to the key information and the target intention.
The apparatus of claim 6, wherein the intent recognition module comprises:

the vector conversion submodule is used for converting the text data into a digital vector through a pre-trained conversion model;

the semantic information identifying submodule is used for identifying semantic information corresponding to the digital vector;

the intention matching submodule is used for determining the matching degree between the semantic information and a plurality of preset intentions;

and the intention determining submodule is used for taking the preset intention with the highest matching degree with the semantic information as the target intention corresponding to the text data.
The apparatus of claim 7, wherein the preset intents include at least one of network connection control, power-off control, volume adjustment, brightness adjustment, and signal source adjustment.
The apparatus of claim 7 or 8, wherein the key information extraction module comprises:

the preset information determining submodule is used for determining the preset information which is correspondingly matched with the target intention in the plurality of preset information according to the target intention;

the marking submodule is used for marking a plurality of vocabularies included in the text data and determining the matching degree of each vocabulary and each preset information;

the target vocabulary determining submodule is used for taking the vocabulary with the highest matching degree with the preset information as the target vocabulary containing the key information;

and the key information acquisition submodule is used for acquiring information included in the target vocabulary as the key information.
The apparatus of claim 6, wherein the obtaining a conversion module comprises:

the acquisition submodule is used for acquiring an input voice signal;

the noise reduction submodule is used for carrying out noise reduction processing on the voice signal to obtain a first signal;

the text conversion sub-module is used for converting the first signal into a first text through a pre-trained text conversion model;

and the correction submodule is used for correcting abnormal data existing in the first text to obtain text data corresponding to the voice signal.
An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the offline speech recognition method according to any of claims 1 to 5.
A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the offline speech recognition method of any one of claims 1 to 5.