CN111326160A

CN111326160A - Speech recognition method, system and storage medium for correcting noise text

Info

Publication number: CN111326160A
Application number: CN202010167523.0A
Authority: CN
Inventors: 陆俊贤; 黄华; 周院平; 孙信中; 矫人全
Original assignee: Nanjing Aoto Electronics Co ltd
Current assignee: Nanjing Aoto Electronics Co ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-06-23

Abstract

The invention relates to a voice recognition method, a system and a storage medium for correcting a noise text, wherein the voice recognition method comprises the following steps: acquiring an audio signal; carrying out voice recognition on the obtained audio signal to obtain an initial recognition text; performing word segmentation operation on the initial recognition text to obtain a decomposed word group; recombining the decomposed phrases to obtain a plurality of recombined sentences; calculating the probability value of each recombined sentence by using an N-Gram model; calculating the weight value of each recombined sentence by using a TF-IDF model according to a pre-constructed service dialogue corpus; and calculating the weighted probability value of each recombined sentence according to the probability value and the weighted value of each recombined sentence, and selecting the recombined sentence with the weighted probability value meeting the preset condition as a result identification text. The noise text with other voice can be filtered to obtain the speech recognition result according with the current conversation scene, so that the accuracy of speech recognition is improved, and the interaction efficiency and the experience are improved.

Description

Speech recognition method, system and storage medium for correcting noise text

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a speech recognition method, system and storage medium for correcting noisy text.

Background

With the continuous development of artificial intelligence technology, in more and more scenes, the artificial intelligence technology is adopted to interact with users to provide various services. The voice recognition is more suitable for the normal communication habit of people, and plays an important role in human-computer interaction.

The scenes such as a bank hall or a business hall belong to relatively noisy environments. In the process of carrying out conversation between the intelligent robot and the user, the intelligent robot can acquire the voice of the user and can pick up a large amount of environmental noise. These ambient noises may be the voice of a conversation by others, or the voice of a machine, or the voice of an external street. Because of the interference of the environmental noises, noise texts may exist in the result of the speech recognition, such as words mixed with other people or meaningless noises, and the intelligent robot cannot effectively judge the actual intention of the user, thereby affecting the interaction efficiency and experience between the intelligent robot and the user.

In the prior art, in order to reduce the interference of the environmental noise to the voice recognition, before extracting the features for voice recognition, the picked-up audio signal is filtered, for example, a volume threshold is set, and the audio signal with the volume less than the volume threshold is filtered. However, these filtering operations before feature extraction may delete the audio signal, which has a certain effect on the integrity of the subsequent speech recognition result; moreover, this filtering operation cannot effectively filter noise signals like human voice, and cannot effectively reduce the interference of environmental noise to voice recognition.

Disclosure of Invention

Therefore, it is necessary to provide a voice recognition method, a voice recognition system and a storage medium for correcting a noise text, aiming at the problems that an intelligent robot cannot accurately judge the intention of a user and the interaction efficiency and experience between the intelligent robot and the user are affected due to the environmental noise and the noise in the existing scene.

An embodiment of the present application provides a speech recognition method for correcting a noise text, including:

acquiring an audio signal;

carrying out voice recognition on the obtained audio signal to obtain an initial recognition text;

performing word segmentation operation on the initial recognition text to obtain a decomposed word group;

recombining the decomposed phrases to obtain a plurality of recombined sentences;

calculating the probability value of each recombined sentence by using an N-Gram model;

calculating the weight value of each recombined sentence by using a TF-IDF model according to a pre-constructed service dialogue corpus;

and calculating the weighted probability value of each recombined sentence according to the probability value and the weighted value of each recombined sentence, and selecting the recombined sentence with the weighted probability value meeting the preset condition as a result identification text.

In some embodiments, the step of recombining the decomposed phrases to obtain a plurality of recombined sentences specifically includes:

performing part-of-speech tagging on the decomposed phrases;

and recombining to obtain a plurality of recombined sentences according to the part of speech of the decomposed phrases.

In some embodiments, after the step of acquiring the audio signal, the method further comprises:

and judging whether a person exists, and if so, performing voice recognition on the acquired audio signal to obtain an initial recognition text.

In some embodiments, when acquiring audio signals, the sound source bearing may also be acquired;

the step of judging whether a person exists specifically comprises the following steps:

and judging whether the sound source direction is occupied, and performing voice recognition on the acquired audio signal only when the sound source direction is occupied to obtain an initial recognition text.

In some embodiments, the recognition result recognizes the semantics of the text by using a pre-trained semantic recognition model, and a semantic result is obtained.

Another embodiment of the present application provides a speech recognition system for correcting a noisy text, including:

an audio acquisition unit for acquiring an audio signal;

the voice recognition unit is used for carrying out voice recognition on the acquired audio signal to obtain an initial recognition text;

the word segmentation unit is used for carrying out word segmentation operation on the initial recognition text to obtain a decomposed word group;

the sentence recombination unit is used for recombining the decomposed phrases to obtain a plurality of recombined sentences;

the sentence probability calculation unit is used for calculating the probability value of each recombined sentence by utilizing the N-Gram model;

the weight determining unit is used for calculating the weight value of each recombined sentence by using a TF-IDF model according to a pre-constructed service dialogue corpus;

and the result text determining unit is used for calculating the weighted probability value of each recombined sentence according to the probability value and the weighted value of each recombined sentence, selecting the recombined sentence with the weighted probability value meeting the preset condition as a result identification text.

In some embodiments, the system further comprises a person detection unit for judging whether a person exists; and if the person is judged, triggering a voice recognition unit to perform voice recognition on the acquired audio signal.

In some embodiments, the apparatus further comprises a stop word filtering unit, configured to perform a stop word removing operation on the decomposed word group.

In some embodiments, the system further includes a semantic recognition unit, configured to recognize semantics of the text with the recognition result using a pre-trained semantic recognition model, and obtain a semantic result.

Another embodiment of the present application further provides a machine-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the speech recognition method for correcting noisy text according to any of the previous embodiments.

The voice recognition method provided by the embodiment of the application carries out word segmentation and recombination on an initial recognition text on the basis of carrying out voice recognition on an audio signal, and carries out semantic level analysis on whether the recombined sentences meet business logic by utilizing an N-Gram model and a TF-IDF model and combining a business dialogue corpus so as to screen the recombined sentences meeting requirements as final recognition texts. The voice recognition method provided by the embodiment of the application can be used for filtering the noise text mixed with other voice so as to obtain the voice recognition result according with the current conversation scene, improve the accuracy of voice recognition and improve the interaction efficiency and experience between the intelligent robot and the user.

Drawings

FIG. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a speech recognition method according to another embodiment of the present application;

FIG. 3 is a block diagram of a speech recognition system according to an embodiment of the present application;

fig. 4 is a schematic architecture diagram of a speech recognition system according to another embodiment of the present application.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings. In addition, the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As shown in fig. 1, an embodiment of the present application discloses a speech recognition method for correcting a noisy text, including:

s100, acquiring an audio signal;

a speech recognition method for correcting noisy text may be performed by a speech recognition system and provides a front-end intelligent robot to provide an interface for user interaction. The intelligent robot is provided with a radio device which can collect audio signals and then send the audio signals to the voice recognition system. The intelligent robot can acquire audio signals in real time and can also acquire the audio signals according to triggering.

The scene in which the intelligent robot interacts with the user generally has a large amount of environmental noise. In order to avoid acquiring a large number of invalid audio signals, the intelligent robot may acquire an audio signal in a preset direction. Specifically, a directional radio device, such as a microphone array, may be used to obtain an audio signal in a preset direction, and the spatial filtering characteristic is used to enhance the audio signal in the preset direction and suppress noise in other directions.

For example, the directional sound receiving device may be embodied as a six-microphone array, and the sound source azimuth may be acquired while the audio signal is acquired. The intelligent robot may be provided with an interaction direction, such as a certain angle range directly in front of the intelligent robot, being the interaction direction. The preset direction of the audio signal is obtained, which may be the same as the interaction direction.

Because of the presence of ambient noise, the acquired audio signal may only include ambient noise, with virtually no user interaction with the intelligent robot. In order to avoid the misjudgment caused by the environmental noise, in some embodiments, as shown in fig. 2, after step S100, the method may further include:

s150, judging whether a person exists, and if so, entering the step S200; if no person is judged, the flow is terminated, and the voice recognition is not carried out on the acquired audio signal.

The detection modes such as laser radar, infrared detection, human face detection, pressure sensing and the like can be used for judging whether a person exists. If no people are detected in the surroundings of the intelligent robot, it can be considered that the acquired audio signal does not contain the voice of the user. On the contrary, if a person is detected in the surroundings of the smart robot, the voice of the user may be included in the audio signal.

In some embodiments, the intelligent robot employs a microphone array to acquire the audio signal and also acquire the sound source bearing. S150, judging whether a person exists in the sound source direction; if the sound source orientation is manned, step S200 can be entered; if the sound source direction is judged to have no person, even if people exist in other directions, the process is terminated, and voice recognition is not carried out on the acquired audio signal.

S200, carrying out voice recognition on the acquired audio signal to obtain an initial recognition text;

s300, performing word segmentation operation on the initial recognition text to obtain a decomposed word group;

common speech recognition techniques can be used to perform speech recognition on the audio signal, resulting in an initial recognized text. For example, a Dynamic Time Warping (DTW) algorithm, a speech recognition algorithm based on a deep neural network-hidden markov model (DNN-HMM), a speech recognition algorithm based on a gaussian mixture model-hidden markov model (GMM-HMM), and the like. Existing speech recognition kits/products such as kaldi, HTK, Julius, etc. may also be used.

The word segmentation operation is to normalize a sentence to form a phrase sequence for subsequent processing. In this embodiment, a common word segmentation algorithm may be used when performing word segmentation operation. Common word segmentation algorithms may include dictionary-based word segmentation methods, statistical-based word segmentation methods, rule-based word segmentation methods, word segmentation methods based on word labeling, and the like. In some examples, a dictionary-based word segmentation method, such as a jieba word segmentation tool, is used to perform a word segmentation operation on the initial recognition text to obtain a decomposed word group. In order to improve the accuracy of word segmentation, an industry dictionary can be combined at the same time.

In human language, there are a large number of functional words, such as the mood co-words, that are not usually intended to have a definite meaning. In order to improve the pertinence of the subsequent processing and reduce the interference of irrelevant words, in some embodiments, as shown in fig. 2, after step S300, the method may further include:

and S350, performing word-stop-removing operation on the decomposed word group.

And (4) removing stop words, namely removing stop words in the decomposed word group. Stop words are words that have little effect on the true semantics of the sentence, and are typically sigh words, moods, etc. The stop word operation can be carried out by utilizing a stop word dictionary constructed in advance. And for the decomposed phrases, searching each phrase in the stop word dictionary, and if the phrase can be searched, removing the phrase.

S400, recombining the decomposed phrases to obtain a plurality of recombined sentences;

the number of phrases in the recombined sentences can be set to be a plurality of different values so as to simulate the actual possible expression as much as possible and avoid missing the real dialogue sentences of the user.

During the reorganization, the decomposed phrases may be arranged and combined according to the quantity value of the phrases in the reorganized sentence, and the result of the arrangement and combination is the plurality of reorganized sentences obtained in S400.

In some embodiments, part-of-speech tagging may be further performed on the decomposed phrase to obtain a part-of-speech of the decomposed phrase. Part of speech refers to the grammatical properties of a phrase, such as nouns, verbs, adjectives, pronouns, numerics, quantifications, adverbs, prepositions, conjunctions, helpers, and the like. Common part-of-speech tagging methods can be used, such as a tagging method based on a statistical model, a tagging algorithm based on rules, an algorithm based on a combination of statistics and rules, a tagging algorithm based on a finite state machine, and a tagging algorithm based on a neural network; existing part-of-speech tagging tools, such as jieba, HanLP, NLTK, etc. toolkits may also be used.

When the sentences are recombined, the decomposed phrases can be recombined according to the parts of speech of the decomposed phrases, so that the recombined sentences which do not accord with the expression habit are reduced, the data processing amount is reduced, and the processing efficiency is improved. For example, according to the syntactic structure, a verb is taken as a core, phrases corresponding to parts of speech are selected at front and rear positions of verb phrases in the initial recognition text, and a plurality of recombined sentences are obtained by permutation and combination. And respectively carrying out the recombination process on all the decomposed verb phrases to obtain a plurality of recombined sentences required by S400. It is understood that other word phrases can be used as the core when the sentence is recombined.

In the following, the initial recognition text is taken as "good me want you to get money and put it in home", so as to visually display the repeated sentences. Assuming that the initial recognition class is segmented to obtain the decomposed phrases of 'good, me, want, hello, get money, put, home', including 3 verbs- 'want, get money, put', then assuming that the sentences are recombined with verbs as the core, at least the following recombined sentences can be obtained: i want to get money, I want to be good, I want to be put in home, I want to get money to be put in home, I want to be put in home, get money, I get money, get money to be put in home, you get money, I put in home, you put in home.

Furthermore, in order to improve the efficiency of sentence recombination and make the sentence recombination more fit to the actual business situation, a preset syntax structure can be combined when the sentence is recombined. Illustratively, common business dialogs can be combed out according to a business dialog corpus, and a preset syntactic structure is constructed. For example, for a withdrawal service, the possible dialog is "i want to get XXX dollars", and the preset syntactic structure may be set to "person pronouns + verbs + number words + quantifier". According to the selected verb phrase, matching with a preset syntactic structure can be carried out; and then, selecting phrases corresponding to the part of speech from the front and rear positions of the verb phrases in the initial recognition text according to the matched preset syntactic structure to obtain the recombined sentence.

S500, calculating probability values of the recombination sentences by using an N-Gram model;

the N-Gram model is a model in natural speech processing, inputs a recombined sentence and outputs a probability of the recombined sentence, in particular to a joint probability of a phrase in the recombined sentence.

The N-Gram model may be one of a Uni-Gram model (N = 1), a Bi-Gram model (N = 2), and a Tri-Gram model (N = 3). The following describes the process of acquiring the probability value of the recombined sentence by taking an N =3 Tri-Gram model as an example.

Assuming a restructured statement S = (w1, w2, ⋯, wn), in the Tri-Gram model, a training sentence library may be constructed in advance, and the probability value output by the restructured statement S is:

P(S) =p(w1w2⋯wn)

=p(w1|begin1,begin2)*p(w2|w1,begin1)*p(w3|w2w1)***p(wn| wn-1,wn-2)，

wherein p (w1| begin1, begin2) represents the total number of all sentences/sentences beginning with w1 in the corpus, p (w3| w2w1) represents the number of times w3w2w1 occurs simultaneously/the number of times w2w1 occurs in the corpus, and p (wn | wn-1, wn-2) represents the number of times wn-1 wn-2 occurs simultaneously/the number of times wn-1 wn-2 occurs in the corpus.

It can be understood that when the probability value of the recombined sentence is calculated by using the N-Gram model, a general training sentence library can be used; a pre-constructed corpus of business dialogs may also be utilized.

S600, calculating the weight value of each recombined sentence by using a TF-IDF model according to a pre-constructed service dialogue corpus;

the TF-IDF model is called Term Frequency-Inverse Document Frequency, is a statistical method and is used for evaluating the importance degree of a target phrase in a Document. The principle is that the importance of a phrase increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

A business dialogue corpus can be constructed in advance, and a TF-IDF weight value of the recombined sentence can be calculated by utilizing a TF-IDF model so as to represent the fit degree of the recombined sentence and the business.

S700, calculating the weighted probability value of each recombined sentence according to the probability value and the weighted value of each recombined sentence, selecting the recombined sentence with the weighted probability value meeting the preset condition as a result identification text.

Because the TF-IDF weight value of the restructured statement can represent the fit degree of the restructured statement and the service. And performing weighted calculation on the probability value of the recombined sentence by using the TF-IDF weight value of the recombined sentence, namely adjusting the probability value of the recombined sentence according to the service scene to obtain the weighted probability value of the recombined sentence.

Preset conditions can be set in advance to screen the repeated sentences to obtain result recognition texts as the formally output voice recognition results. The preset condition can be a recombination sentence with the maximum weighted probability value; or the text may be identified by size ordering with the top m reformulated sentences as the results.

In some scenarios, not all of the initially recognized texts include noise texts, and therefore, in some embodiments, as shown in fig. 2, after step S200, the method may further include:

s250, judging the sentence smoothness of the initial identification text, and if the sentence smoothness of the initial identification text is judged to be not smooth, entering the step S300; and if the sentence of the initial recognition text is judged to be smooth, the initial recognition text is taken as a result recognition text.

The sentence smoothness of the initial recognition text can be judged by adopting a conventional sentence smoothness judgment method in the field of natural language processing. For example, based on an N-Gram model, calculating a probability value of a sentence, and judging the sentence continuity according to the probability value of the sentence; or the sentence smoothness is judged according to the value of the score output by the neural network by using the pre-trained neural network for judging the sentence smoothness.

By increasing the judgment of sentence continuity of the initial recognition texts, which initial recognition texts contain the noise texts can be judged, and which initial recognition texts contain the noise texts. Wherein, only the initial recognition text containing the noise text can enter the subsequent steps, and the system overhead can be effectively reduced.

In some embodiments, as shown in fig. 2, the speech recognition method, after step S700, further includes:

and S800, recognizing the semantics of the text by using the pre-trained semantic recognition model to obtain a semantic result.

A semantic recognition model may be trained using a pre-constructed corpus of business dialogs. After the result recognition text is obtained, a semantic result which accords with business logic can be obtained by utilizing a pre-trained semantic recognition model. The system can respond to the actual intention of the user according to the semantic result and by combining with the service logic, for example, if the user needs to collect money, the number collecting operation of the money collecting service can be executed; if the user wants to transfer money, a transfer service flow or interface may be provided.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

As shown in fig. 3, an embodiment of the present application discloses a speech recognition system for correcting a noisy text, comprising:

an audio acquisition unit 100 for acquiring an audio signal;

the voice recognition unit 200 is configured to perform voice recognition on the acquired audio signal to obtain an initial recognition text;

a word segmentation unit 300, configured to perform word segmentation on the initial recognition text to obtain a decomposed word group;

a sentence reorganizing unit 400, configured to reorganize the decomposed phrases to obtain a plurality of reorganized sentences;

a sentence probability calculating unit 500, configured to calculate a probability value of each recombined sentence by using an N-Gram model;

a weight determining unit 600, configured to calculate, according to a pre-constructed service dialogue corpus, a weight value of each recombination statement by using a TF-IDF model;

and a result text determining unit 700, configured to calculate a weighted probability value of each recombinant sentence according to the probability value and the weight value of each recombinant sentence, select a recombinant sentence with a weighted probability value meeting a preset condition, and identify the text as a result.

The specific working modes of the audio acquiring unit 100, the speech recognizing unit 200, the word segmentation unit 300, the sentence recombining unit 400, the sentence probability calculating unit 500, the weight determining unit 600, and the result text determining unit 700 can be referred to the description in the foregoing method embodiments.

In some embodiments, the audio obtaining unit 100 is configured to obtain an audio signal in a preset direction.

In some embodiments, as shown in fig. 4, the speech recognition system may further include a person detection unit 150 for determining whether a person is present; if the person is judged, the voice recognition unit 200 is triggered to perform voice recognition on the acquired audio signal; and if no person is judged, terminating and not carrying out voice recognition on the acquired audio signal. The person detection unit 150 may be triggered by the audio acquisition unit 100, for example, the person detection unit 150 is triggered to detect each time the audio acquisition unit 100 acquires an audio signal. Therefore, misjudgment caused by environmental noise can be avoided, and the system overhead is reduced.

Further, the audio acquisition unit 100 may acquire the sound source bearing while acquiring the audio signal. The person detection unit 150 may be specifically configured to determine whether a person is present in the sound source direction; if the sound source direction is occupied, the voice recognition unit 200 can be triggered to perform voice recognition on the acquired audio signal; if it is judged that there is no person in the sound source direction, even if there is a person in another direction, the voice recognition is terminated without performing the voice recognition on the acquired audio signal. Therefore, whether the user is interacting or not can be judged more accurately. Only when the sound source position is consistent with the detected position of the person, the user hopes to interact with the sound source position, and the judgment accuracy is improved.

In some embodiments, as shown in fig. 4, the speech recognition system may further include a stop word filtering unit 350 for performing a stop word operation on the decomposed phrase. Therefore, the pertinence of subsequent processing can be improved, and the interference of irrelevant words is reduced.

In some embodiments, as shown in fig. 4, the speech recognition system may further include a sentence continuity determining unit 250, configured to determine sentence continuity of the initial recognition text, and if it is determined that the sentence of the initial recognition text is not smooth, trigger the word segmentation unit 300 to perform word segmentation on the initial recognition text; and if the sentence of the initial recognition text is judged to be smooth, the initial recognition text is taken as a result recognition text. Therefore, only the initial recognition text containing the noise text can enter the subsequent steps, and the system overhead can be effectively reduced.

In some embodiments, as shown in fig. 4, the speech recognition system may further include a semantic recognition unit 800, configured to recognize semantics of the text by using a pre-trained semantic recognition model, and obtain a semantic result.

According to the voice recognition scheme provided by the embodiment of the application, on the basis of voice recognition of an audio signal, word segmentation and recombination are carried out on an initial recognition text, a N-Gram model and a TF-IDF model are utilized, a business dialogue corpus is combined, semantic level analysis whether a recombination sentence accords with business logic is carried out on the recombination sentence, and the recombination sentence which accords with requirements is screened out to serve as a final recognition text. The speech recognition scheme that this application embodiment provided can filter the noise text that is mingled with other people's voice to obtain the speech recognition result that accords with the current dialogue scene, improve speech recognition's accuracy nature, promote interactive efficiency and experience between intelligent robot and the user.

An embodiment of the present application provides a machine-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the speech recognition method for correcting noisy text according to any of the embodiments described above.

The system/computer device integrated components/modules/units, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

In the several embodiments provided in the present invention, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative, and for example, the division of the components is only one logical division, and other divisions may be realized in practice.

In addition, each functional module/component in each embodiment of the present invention may be integrated into the same processing module/component, or each module/component may exist alone physically, or two or more modules/components may be integrated into the same module/component. The integrated modules/components can be implemented in the form of hardware, or can be implemented in the form of hardware plus software functional modules/components.

It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, apparatus or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech recognition method for correcting noisy text, comprising:

acquiring an audio signal;

2. The speech recognition method according to claim 1, wherein the step of recombining the decomposed phrases to obtain a plurality of recombined sentences specifically comprises:

performing part-of-speech tagging on the decomposed phrases;

3. The speech recognition method of claim 1, further comprising, after the step of obtaining an audio signal:

4. A speech recognition method according to claim 3, characterized in that when the audio signal is acquired, the sound source orientation is also acquired;

5. The speech recognition method of claim 1, further comprising:

and recognizing the semantics of the text by using the pre-trained semantic recognition model and the recognition result to obtain a semantic result.

6. A speech recognition system for correcting noisy text, comprising:

an audio acquisition unit for acquiring an audio signal;

7. The voice recognition system according to claim 6, further comprising a person detection unit for determining whether a person is present; and if the person is judged, triggering a voice recognition unit to perform voice recognition on the acquired audio signal.

8. The speech recognition system of claim 6, further comprising a stop word filtering unit configured to perform a stop word operation on the decomposed phrase.

9. The speech recognition system of claim 6, further comprising a semantic recognition unit, configured to recognize semantics of the text with the recognition result using a pre-trained semantic recognition model, and obtain a semantic result.

10. A machine readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of speech recognition to correct noisy text according to any of claims 1-5.