CN112185351A

CN112185351A - Voice signal processing method and device, electronic equipment and storage medium

Info

Publication number: CN112185351A
Application number: CN201910606001.3A
Authority: CN
Inventors: 李思达; 韩伟; 王阳阳; 李曙光
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2021-01-05

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a voice signal processing method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: carrying out voice recognition on audio stream data acquired by intelligent equipment in real time to obtain a temporary recognition result; determining a corresponding corpus set according to at least one temporary recognition result, wherein the corpus set comprises at least one corpus; and if any temporary recognition result is matched with any corpus in the corpus set, determining the matched corpus as a predicted text of the temporary recognition result. The technical scheme provided by the embodiment of the invention improves the efficiency of text prediction and shortens the response time of intelligent equipment.

Description

Voice signal processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing a voice signal, an electronic device, and a storage medium.

Background

With the rapid development of science and technology, intelligent equipment already has strong processing capacity, so that the intelligent equipment can understand natural language to a certain extent like human beings, and human-computer interaction is realized. An important link in the natural language processing process is semantic recognition, and the existing speech signal processing method is usually implemented based on a fixed corpus, that is, based on a speech recognition result corresponding to speech data input by a user, corresponding corpora are obtained from the corpus, and a semantic recognition result is determined based on the obtained corpora. However, the corpus has a large number of corpora, which results in low matching efficiency, and thus the semantic recognition speed is slow, so that the response time of the intelligent device is prolonged, the user cannot get a timely response, and the user experience is reduced.

Disclosure of Invention

The embodiment of the invention provides a voice signal processing method and device, electronic equipment and a storage medium, and aims to solve the problem that in the prior art, the response time of intelligent equipment is long due to the low semantic recognition speed.

In a first aspect, an embodiment of the present invention provides a speech signal processing method, including:

carrying out voice recognition on audio stream data acquired by intelligent equipment in real time to obtain a temporary recognition result;

determining a corresponding corpus set according to at least one temporary recognition result, wherein the corpus set comprises at least one corpus;

and if any temporary recognition result is matched with any corpus in the corpus set, determining the matched corpus as a predicted text of the temporary recognition result.

Optionally, the determining, according to the at least one temporary recognition result, a corresponding corpus set specifically includes:

and selecting a corpus candidate matched with the temporary recognition result from the corpus to obtain a corpus set.

Optionally, selecting a corpus matched with the temporary recognition result from the corpus to obtain a corpus set, which specifically includes:

if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sequenced according to the text length of each candidate corpus, and a first preset number of candidate corpuses with the front sequencing are selected to obtain a corpus set;

or if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sorted according to the hit times of the candidate corpuses, and the first preset number of the candidate corpuses with the front sorting is selected to obtain a corpus set.

Optionally, if any subsequent temporary recognition result is matched with any corpus in the corpus set, determining the matched corpus as the predicted text of the temporary recognition result, specifically including:

and if the next temporary recognition result is consistent with any one corpus in the corpus set, determining the corpus as a predicted text of the next temporary recognition result.

Optionally, the method further comprises: and if the next temporary recognition result is not consistent with all the linguistic data in the linguistic data set, re-determining the linguistic data set according to the next temporary recognition result.

Optionally, the re-determining the corpus set according to the next temporary recognition result specifically includes:

selecting a candidate corpus matched with the next temporary recognition result from the corpus to obtain a first candidate set;

and selecting candidate corpora different from the corpora contained in the previously determined corpus set from the first candidate set, and adding the selected candidate corpora to the corpus set.

determining corresponding corpus sets according to the characteristic words corresponding to the corpus sets and at least one temporary recognition result, wherein the corpus sets contain the same characteristic words and are divided into the same corpus set.

Optionally, the determining, according to the feature words and the at least one temporary recognition result corresponding to each corpus set included in the corpus, a corresponding corpus set specifically includes:

if the characteristic words corresponding to any corpus set are consistent with at least part of texts contained in the temporary recognition result, determining the corpus set as a corpus set corresponding to the temporary recognition result;

or if the similarity between the feature words corresponding to any corpus set and the temporary identification result is higher than a first threshold, determining the corpus set as the corpus set corresponding to the temporary identification result.

determining an invalid text in the temporary recognition result according to the matching result of the at least one temporary recognition result and the feature words;

and determining the corresponding corpus sets according to the characteristic words corresponding to the corpus sets and the effective texts except the invalid texts in at least one temporary recognition result.

Optionally, the determining, according to the at least one temporary recognition result and the matching result of the feature word, an invalid text in the current temporary recognition result specifically includes:

if the first feature word matched with the current temporary recognition result is different from the second feature word matched with the last temporary recognition result, and the similarity between the first feature word and the current temporary recognition result is higher than the similarity between the second feature word and the last temporary recognition result, determining that the last temporary recognition result is an invalid text;

or if the temporary recognition result contains a preset high-frequency word and the similarity corresponding to the first characteristic word matched with the temporary recognition result is higher than the similarity corresponding to the second characteristic word matched with the temporary recognition result at the last time, determining that the text before the high-frequency word contained in the temporary recognition result is an invalid text.

Optionally, it is determined that any subsequent temporary recognition result matches any corpus in the corpus set according to the following manner:

determining the corpus matched with any one-time temporary recognition result according to the similarity between any one-time temporary recognition result and any one corpus in the corpus set; or

And if any corpus in the corpus set contains any subsequent temporary recognition result, determining that the any corpus is a corpus matched with the temporary recognition result at any time.

Optionally, the determining, according to the similarity between any subsequent temporary recognition result and any corpus in the corpus set, a corpus matched with any subsequent temporary recognition result specifically includes:

if the difference between the similarity of the first corpus and the similarity of the second corpus is larger than a preset difference, determining that the first corpus is the corpus matched with the temporary identification result, wherein the first corpus is the corpus with the highest similarity with the temporary identification result in the corpus set, and the second corpus is the corpus with the highest similarity with the temporary identification result in the corpus set;

or, if the corpora with the highest similarity to the multiple adjacent temporary recognition results in the corpus are the first corpora respectively, and the trend of the similarity between the multiple temporary recognition results and the first corpora is increased and then decreased, determining that the first corpora is the corpora matched with the first temporary recognition result, and the first temporary recognition result is the temporary recognition result with the highest similarity value to the first corpora in the multiple temporary recognition results.

Optionally, the method further comprises:

adding a truncation identifier after an identified text included in the temporary identification result, wherein the identified text is the temporary identification result corresponding to the predicted text;

and determining a corresponding corpus set according to the text after the identifier is truncated in the at least one temporary recognition result.

Optionally, before determining the corresponding corpus set according to at least one temporary recognition result for the first time, the method further includes:

and determining that the number of characters contained in the temporary recognition result exceeds a second preset number.

Optionally, the method further comprises:

and if a voice end point is detected in the audio stream data collected by the intelligent equipment in real time, emptying the obtained temporary recognition result, and returning to the step of carrying out voice recognition on the audio stream data collected by the intelligent equipment in real time to obtain the temporary recognition result.

In a second aspect, an embodiment of the present invention provides a speech signal processing apparatus, including:

the voice recognition module is used for carrying out voice recognition on audio stream data acquired by the intelligent equipment in real time to obtain a temporary recognition result;

the determining module is used for determining a corresponding corpus set according to at least one temporary recognition result, wherein the corpus set comprises at least one corpus;

and the prediction module is used for determining the matched linguistic data as the prediction text of the temporary recognition result if any subsequent temporary recognition result is matched with any linguistic data in the linguistic data set.

Optionally, the determining module is specifically configured to:

Optionally, the prediction module is specifically configured to:

Optionally, the determining module is further configured to:

and if the next temporary recognition result is not consistent with all the linguistic data in the linguistic data set, re-determining the linguistic data set according to the next temporary recognition result.

Optionally, the determining module is specifically configured to:

Optionally, the prediction module is specifically configured to determine that any temporary recognition result after the temporary recognition is matched with any corpus in the corpus set according to the following manner:

determining the corpus matched with any one-time temporary recognition result according to the similarity between any one-time temporary recognition result and any one corpus in the corpus set;

or if any corpus in the corpus set contains any next temporary recognition result, determining that the any corpus is the corpus matched with the any temporary recognition result.

Optionally, the prediction module is specifically configured to:

Optionally, the determining module is further configured to:

and determining that the number of characters contained in the temporary recognition result exceeds a second preset number before determining the corresponding corpus set according to at least one temporary recognition result for the first time.

Optionally, the apparatus further comprises an emptying module, configured to:

and if a voice end point is detected in the audio stream data collected by the intelligent equipment in real time, emptying the obtained temporary recognition result, and returning to execute the function of the voice recognition module.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.

In a fourth aspect, an embodiment of the invention provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of any of the methods described above.

In a fifth aspect, an embodiment of the invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions which, when executed by a processor, implement the steps of any of the methods described above.

According to the technical scheme provided by the embodiment of the invention, the corpus set matched with the temporary recognition result can be determined according to the temporary recognition result obtained in real time, the matching range is narrowed, then, a more complete temporary recognition result is obtained along with the speech recognition, at the moment, the predicted text corresponding to the current temporary recognition result can be determined from the corpus set, so that the data amount needing to be matched in the text prediction process is greatly reduced, the text prediction efficiency is improved, the semantic recognition efficiency is further improved, the intelligent equipment can make real-time response aiming at the input of a user, and the user experience is improved. In addition, because the processes of voice recognition and text prediction are carried out synchronously, the text prediction can be basically completed synchronously while the voice recognition is completed, and the text prediction efficiency is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a speech signal processing method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a speech signal processing method according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a speech signal processing method according to an embodiment of the invention;

fig. 4 is a flowchart illustrating a speech signal processing method according to an embodiment of the invention;

fig. 5 is a flowchart illustrating a speech signal processing method according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:

the phrase is a unit of language without sentence tone combined by three language units which can be matched on the three levels of syntax, semantics and language, and is called a phrase. The phrase is larger than the word and is not a grammatical unit of a sentence, a simple phrase can serve as a syntactic component of a complex phrase, and the phrase and the sentence tone can become a sentence. Generally, the phrase includes at least one word (word), for example, the phrase may be "introduction", "Beijing cuisine", etc.

The morpheme is the smallest combination of pronunciation and meaning in a language, that is, a language unit must satisfy three conditions simultaneously, namely, "smallest, voiced, sense", to be called as morpheme, especially "smallest" and "sense".

Real-time speech transcription (Real-time ASR) is based on a deep full-sequence convolutional neural network framework, long connection between an application and a language transcription core engine is established through a WebSocket protocol, audio stream data can be converted into character stream data in Real time, a user can generate a text while speaking, and a recognized temporary recognition result is output generally according to morphemes as a minimum unit. For example, the captured audio stream is: the steps of ' today ' day ' gas ' how ' to ' how ' and ' like ' are sequentially identified according to the sequence of the audio stream, the temporary identification result ' today ' is output, then the temporary identification result ' today ' is output, and so on until the whole audio stream is identified, and the final identification result ' how the weather is today ' is obtained. The real-time voice transcription technology can also carry out intelligent error correction on the previously output temporary recognition result based on subsequent audio stream and semantic understanding of context, so as to ensure the accuracy of the final recognition result, that is, the temporary recognition result based on the audio stream real-time output continuously changes along with time, for example, the temporary recognition result output for the first time is gold, the temporary recognition result output for the second time is corrected to be today, the temporary recognition result output for the third time can be today Tian, the temporary recognition result output for the fourth time is corrected to be today weather, and so on, and the accurate final recognition result is obtained through continuous recognition and correction.

Voice Activity Detection (VAD), also called Voice endpoint Detection, refers to detecting the existence of Voice in a noise environment, and is generally used in Voice processing systems such as Voice coding and Voice enhancement, and plays roles of reducing a Voice coding rate, saving a communication bandwidth, reducing energy consumption of a mobile device, improving a recognition rate, and the like. A representative VAD method of the prior art is ITU-T G.729Annex B. At present, a voice activity detection technology is widely applied to a voice recognition process, and a part of a segment of audio that really contains user voice is detected through the voice activity detection technology, so that a mute part of the audio is eliminated, and only the part of the audio that contains the user voice is recognized.

Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The existing semantic recognition method is usually implemented based on a fixed corpus, that is, based on a speech recognition result corresponding to speech data input by a user, a corresponding corpus is obtained from the corpus as a predicted text, and then a semantic recognition result is obtained based on the predicted text. However, the corpus has a large number of corpora, which reduces the matching efficiency, thereby prolonging the response time of the smart device, preventing the user from getting a timely reply, and reducing the user experience. In addition, with the development of the voice recognition technology, real-time voice transcription can be realized at present, namely, continuously input audio stream data is converted into character stream data in real time, and a user does not need to wait for the user to speak a complete section of voice and then generate corresponding characters based on the whole section of voice. However, most temporary recognition results are intermediate results, and not final speech recognition results, such as "today", "today weather", and the like, based on these intermediate results, it is difficult to match appropriate corpora from the corpus, and if the final speech recognition results are matched in the corpus after being recognized, it obviously wastes high processing efficiency brought by the real-time speech transcription technology. Therefore, it is highly desirable to improve the processing efficiency of semantic recognition.

Therefore, the inventor of the present invention considers that voice recognition is performed on audio stream data acquired by an intelligent device in real time to obtain a temporary recognition result, a corresponding corpus set is determined according to at least one temporary recognition result, the corpus set includes at least one corpus, and if any one of the temporary recognition results is matched with any one of the corpora in the corpus set, the matched corpus is determined as a predicted text of the temporary recognition result. Therefore, the corpus set matched with the temporary recognition result can be determined according to the temporary recognition result obtained in real time, the matching range is narrowed, then a more complete temporary recognition result is obtained along with the speech recognition, at the moment, the predicted text corresponding to the current temporary recognition result can be determined from the corpus set, the data amount needing to be matched in the text prediction process is greatly reduced, the text prediction efficiency is improved, and the semantic recognition efficiency is further improved. In addition, because the processes of voice recognition and text prediction are carried out synchronously, the text prediction can be basically completed synchronously while the voice recognition is completed, and the text prediction efficiency is further improved.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Fig. 1 is a schematic view of an application scenario of a speech signal processing method according to an embodiment of the present invention. During the interaction between the user 10 and the smart device 11, the smart device 11 continuously collects ambient sounds and continuously reports the ambient sounds to the server 12 in the form of voice data, where the voice data may include ambient sounds around the smart device 11 or speech sounds of other users in addition to the speech sound of the user 10. The server 12 sequentially performs voice recognition processing and semantic recognition processing on the voice data continuously reported by the intelligent device 11, determines corresponding response data according to a semantic recognition result, and controls the intelligent device 11 to output the response data so as to feed back to the user.

In this application scenario, the smart device 11 and the server 12 are communicatively connected through a network, which may be a local area network, a wide area network, or the like. The smart device 11 may be a smart speaker, a robot, or the like, a portable device (e.g., a mobile phone, a tablet, a notebook, or the like), or a Personal Computer (PC). The server 12 may be any server, a server cluster composed of several servers, or a cloud computing center capable of providing voice recognition services.

Of course, the speech recognition processing and the semantic recognition processing of the speech data, and the subsequent processing of determining the response data and the like may also be executed on the intelligent device side, and the execution subject is not limited in the embodiment of the present invention. For convenience of description, in each embodiment provided by the present invention, the speech processing is performed at the server side for example, and the process of performing the speech processing at the intelligent device side is similar to this, and is not described herein again.

The following describes a technical solution provided by an embodiment of the present invention with reference to an application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present invention provides a speech signal processing method, applied to the server side shown in fig. 1, including the following steps:

s201, voice recognition is carried out on audio stream data collected by the intelligent device in real time, and a temporary recognition result is obtained.

In the embodiment of the invention, after a user starts to talk with the intelligent device, the intelligent device can continuously collect the sound around the intelligent device, convert the sound into audio stream data and send the audio stream data to the server. The server can perform voice recognition on continuous audio stream data by using technologies such as real-time voice transcription and the like, and update the temporary recognition result in real time, wherein each update is performed on the basis of the temporary recognition result updated last time. It should be noted that the latest obtained temporary recognition result may be updated in real time along with new audio stream data uploaded by the smart device, for example, the temporary recognition result obtained at the beginning is "gold", on the basis of the temporary recognition result "gold", the temporary recognition result "gold" is updated based on subsequent audio stream data to obtain an updated temporary recognition result, the updated temporary recognition result may be corrected to "today", the next updated temporary recognition result may be "today field", the temporary recognition result continues to be updated based on the audio stream data, and the updated temporary recognition result may be corrected to "today weather".

It should be noted that, the step S201 is continuously performed, that is, as long as the intelligent device continuously uploads the acquired audio stream data to the server, the server continuously performs speech recognition on the audio stream data to obtain a new temporary recognition result.

S202, determining a corresponding corpus set according to at least one temporary recognition result, wherein the corpus set comprises at least one corpus.

In specific implementation, a large amount of corpora with complete semantics are stored in the corpus in advance, for example, "how much the weather is today", "which movies are shown recently", "introduce blue and white porcelain", and the like. The corpus set can be determined according to at least one temporary recognition result based on the corpus.

S203, if any temporary recognition result is matched with any language material in the language material set, determining the matched language material as a prediction text of the temporary recognition result.

In specific implementation, after the corpus set is determined according to step S202, the temporary recognition result of this time is matched with the corpus in the corpus set, and the matched corpus is determined as the prediction text of the temporary recognition result of this time.

According to the method provided by the embodiment of the invention, the corpus set matched with the temporary recognition result can be determined according to the temporary recognition result obtained in real time, the matching range is narrowed, then a more complete temporary recognition result is obtained along with the speech recognition, and at the moment, the predicted text corresponding to the current temporary recognition result can be determined from the corpus set, so that the data amount needing to be matched in the text prediction process is greatly reduced, the text prediction efficiency is improved, the semantic recognition efficiency is further improved, the intelligent equipment can make a real-time response aiming at the input of a user, and the user experience is improved. In addition, since the speech recognition (corresponding to step S201) and the text prediction (corresponding to step S202 and step S203) are performed in synchronization, the text prediction can be basically completed in synchronization with the completion of the speech recognition, and the text prediction efficiency is further improved.

As a possible implementation manner, step S202 specifically includes: determining corresponding corpus sets according to the characteristic words corresponding to the corpus sets and at least one temporary recognition result, wherein the corpus sets contain the same characteristic words, and the corpora containing the same characteristic words are divided into the same corpus set.

To this end, referring to fig. 3, another speech signal processing method according to an embodiment of the present invention includes the following steps:

s301, voice recognition is carried out on audio stream data collected by the intelligent device in real time, and a temporary recognition result is obtained.

S302, determining a corresponding corpus set according to the characteristic words corresponding to each corpus set and at least one temporary recognition result, wherein the corpus sets containing the same characteristic words are divided into the same corpus set.

And S303, if any subsequent temporary recognition result is matched with any corpus in the corpus set determined in the step S302, determining the matched corpus as a predicted text of the temporary recognition result.

In the embodiment of the invention, the characteristic words are phrases contained in a plurality of preset linguistic data, and the characteristic words can be positioned at the beginning, in the middle or at the end of the linguistic data. For example, the language material "introduce blue and white porcelain", "introduce Beijing cuisine" includes the characteristic word "introduce", the language material "introduces blue and white porcelain", "which dynasty the blue and white porcelain is", and "when the blue and white porcelain is unearthed" includes the characteristic word "blue and white porcelain".

In specific implementation, the corpora having the same feature words in the corpus are divided into the same corpus set in advance. For example, "introduce blue and white porcelain", "introduce beijing gourmet" and the like all have the same feature word "introduce", the "introduce blue and white porcelain", "introduce beijing gourmet" are all categorized into the corpus set corresponding to the feature word "introduce one", the corpora in the corpus are divided into a plurality of corpus sets according to the above manner, and each corpus set corresponds to a unique feature word. It should be noted that the same corpus may be divided into a plurality of corpus sets, for example, "introduce blue and white porcelain" may be divided into a corpus set corresponding to "introduce" and a corpus set corresponding to "blue and white porcelain".

Referring to the speech signal processing method shown in fig. 3, when speech recognition is performed, each time a temporary recognition result is obtained, the temporary recognition result is matched with the feature words corresponding to each corpus set in the corpus until a matched corpus set (for convenience of description, it may be referred to as a target corpus set) is determined; after that, every time a temporary recognition result is obtained, the temporary recognition result is matched with the linguistic data in the target linguistic data set, and the matched linguistic data is determined as the prediction text of the temporary recognition result. Since the temporary recognition result obtained from the beginning contains fewer words, the corpus set corresponding to the temporary recognition result is matched based on the feature words, and the corpus set does not need to be matched with the complete corpus, so that the matching efficiency is improved. In the process of determining the target corpus set, the number of words contained in the temporary recognition result is gradually increased, and after the target corpus set is determined, the temporary recognition result is matched with the corpus in the target corpus set, so that the data volume needing to be matched is greatly reduced, the text prediction efficiency is improved, and the response time of the intelligent device for user input is shortened. In addition, because the voice recognition and text prediction processes are carried out synchronously, the range of the linguistic data to be matched can be gradually reduced in the voice recognition process, so that the text prediction can be basically completed synchronously while the voice recognition is completed, and the text prediction efficiency is further improved.

In specific implementation, step S302 may be implemented as follows:

in a first mode, if the feature words corresponding to any one corpus set are consistent with at least part of texts contained in the current temporary recognition result, the corpus set is determined as the corpus set corresponding to the current temporary recognition result.

Specifically, the feature words corresponding to each corpus set in the corpus are sequentially compared with the current temporary recognition result, and if a certain feature word is completely consistent with the current temporary recognition result or the feature word is consistent with a part of text in the current temporary recognition result, the corpus set corresponding to the feature word is determined as the corpus set corresponding to the current temporary recognition result, that is, the target corpus set is determined.

Further, if the corresponding corpus set is not determined based on the current temporary recognition result, the corresponding corpus set is determined based on the next temporary recognition result.

For example, if the temporary recognition result is "introduction", and the feature words included in the corpus are "introduction once", and "introduction" is inconsistent with "introduction once", the corresponding corpus set cannot be determined based on the temporary recognition result; and if the next temporary recognition result is 'introduction once' and is consistent with the characteristic word 'introduction once', determining that the corpus set corresponding to the next temporary recognition result 'introduction once' is the corpus set corresponding to the characteristic word 'introduction once'.

For example, if the current temporary recognition result is "o introduced", and the current temporary recognition result includes a feature word "introduced", it may be determined that the corpus set corresponding to the feature word "introduced" is the corpus set corresponding to the current temporary recognition result "o introduced".

And in the second mode, if the similarity between the characteristic words corresponding to any one corpus set and the temporary recognition result is higher than the first threshold, determining the corpus set as the corpus set corresponding to the temporary recognition result.

In specific implementation, the similarity between the current temporary recognition result and the feature word may be calculated by any existing similarity algorithm, for example, a text similarity algorithm, an euclidean distance, or the like.

In the embodiment of the present invention, the first threshold may be determined by those skilled in the art according to the accuracy of the selected similarity algorithm and the actual requirement, and is not particularly limited.

Specifically, the similarity between the feature words corresponding to each corpus set in the corpus and the temporary recognition result of this time is sequentially calculated, and if the similarity between a certain feature word and the temporary recognition result of this time is higher than a first threshold, the corpus set corresponding to the feature word is determined as the corpus set corresponding to the temporary recognition result of this time. If the similarity between a plurality of feature words and the temporary recognition result is higher than the first threshold, the corpus sets corresponding to the feature words may all be determined as the corpus set corresponding to the temporary recognition result. Of course, if the similarity between a plurality of feature words and the current temporary recognition result is higher than the first threshold, the corpus set corresponding to the feature word with the highest similarity may also be selected and determined as the corpus set corresponding to the current temporary recognition result.

For example, the temporary recognition result is "introduction", and the similarity between the temporary recognition result and the feature word "introduction" is lower than a first threshold, so that the corresponding corpus set cannot be determined based on the temporary recognition result; and if the next temporary recognition result is 'introduction once', and the similarity with the characteristic word 'introduction once' is higher than a first threshold value, determining that the corpus set corresponding to the next temporary recognition result 'introduction once' is the corpus set corresponding to the characteristic word 'introduction once'.

For example, assuming that only the corpus corresponding to the feature word "introduce next" is in the corpus, and the temporary recognition result is "introduce next", if the first threshold is set reasonably, it may be determined that the similarity between the temporary recognition result "introduce next" and the feature word "introduce next" is higher than the first threshold, and at this time, the corpus corresponding to the feature word "introduce next" is taken as the corpus corresponding to the temporary recognition result "introduce next". Therefore, the generalization capability of the matched feature words can be improved, so that the method is suitable for various expression modes of users.

In practical application, because the spoken language expression of the user lacks normative or contains some vocals, some semantic-free contents appear in the speech recognition result corresponding to the audio stream data input by the user, and particularly when the semantic-free contents appear at the beginning of a sentence, the difficulty of matching the semantic-free contents to the feature words is increased, so that the time required by text prediction is prolonged, and even an accurate predicted text cannot be obtained possibly. For example, the voice input by the user is "green island introduced by kay and hiccup", "green island introduced by o i want", where "kay and hiccup" and "o i want to do so" are all contents without semantics, and it is obvious that corresponding feature words cannot be matched based on these contents.

For this reason, as shown in fig. 4, on the basis of any of the above embodiments, step S302 can also be implemented by:

s401, determining an invalid text in the temporary recognition result according to the matching result of the at least one temporary recognition result and the feature words.

S402, determining the corresponding corpus set according to the feature words corresponding to each corpus set and the effective text without the invalid text in the at least one temporary recognition result.

In specific implementation, step S401 may be implemented as follows:

in the first mode, if the first feature word matched with the current temporary recognition result is different from the second feature word matched with the last temporary recognition result, and the similarity between the first feature word and the current temporary recognition result is higher than the similarity between the second feature word and the last temporary recognition result, the last temporary recognition result is determined to be an invalid text.

In a specific implementation, the first feature word may be a feature word with the highest similarity to the current temporary recognition result, and the second feature word may be a feature word with the highest similarity to the last temporary recognition result.

For example, the text corresponding to the audio stream data collected by the smart device is "so look at so introduce a blue and white porcelain". The second feature word matched based on the previous temporary recognition result "see" is "check", the similarity between the second feature word "see" and "see" is lower, the first feature word matched based on the current temporary recognition result "see" is "introduction", and the similarity between the first feature word and the current temporary recognition result is higher than the similarity between the second feature word and the previous temporary recognition result, so that the previous temporary recognition result "see" can be determined as an invalid text. After the invalid text is determined, when the temporary recognition result is "o see o introduce", determining that the valid text in the temporary recognition result "o see o introduce" is "introduce", and determining the corresponding corpus set according to the feature words corresponding to each corpus set and the valid text "introduce".

And in the second mode, if the temporary recognition result contains preset high-frequency words and the similarity corresponding to the first characteristic words matched with the temporary recognition result is higher than the similarity corresponding to the second characteristic words matched with the temporary recognition result at the last time, determining that the text before the high-frequency words contained in the temporary recognition result is invalid text.

The high-frequency words in the embodiment of the present invention may be set artificially according to actual requirements, for example, words frequently used in an application scenario of an intelligent device, or determined by counting high-frequency words appearing in a corpus.

For example, the text corresponding to the audio stream data collected by the smart device is "so look at so introduce a blue and white porcelain". When the temporary recognition result is 'o see o', the high-frequency vocabulary is not included; when the temporary recognition result is "a so-to-see introduction", it may be recognized that the temporary recognition result includes a preset high-frequency vocabulary "introduction", and a similarity corresponding to the feature word "introduction" matched based on the temporary recognition result "a so-to-see introduction" is obviously higher than a similarity corresponding to the feature word "check" matched based on the previous temporary recognition result "a so-to-see introduction", so that it may be determined that a text "a so-to-see" before the temporary recognition result "a so-to-see introduction" medium-high-frequency vocabulary "introduction" is an invalid text. Then, text prediction, semantic recognition and other processing may be performed based on the valid text after the invalid text in the obtained temporary recognition result, for example, when the temporary recognition result is "where" introduces blue and white porcelain ", text prediction, semantic recognition and other processing may be performed on the valid text" introduces blue and white porcelain "after the invalid text" where "o sees".

For convenience of description, the corresponding corpus set determined in step S302 is referred to as a target corpus set.

On the basis of any of the above embodiments, the corpus matching the temporary recognition result may be determined from the target corpus set by: and determining the corpus matched with any temporary recognition result according to the similarity between any temporary recognition result and any corpus in the target corpus set after the step S302 is executed.

In specific implementation, the similarity between the temporary recognition result and the corpus in the target corpus set may be calculated by any existing similarity algorithm, for example, a text similarity algorithm, an euclidean distance, or the like.

In specific implementation, based on the similarity between the temporary recognition result and the corpus in the target corpus set, the corpus matched with the temporary recognition result can be determined in the following manner:

and in a first mode, if the difference value between the similarity of the first corpus and the similarity of the second corpus is greater than a preset difference value, determining the first corpus as the corpus matched with the temporary identification result, wherein the first corpus is the corpus with the highest similarity with the temporary identification result in the target corpus set, and the second corpus is the corpus with the second highest similarity with the temporary identification result in the target corpus set.

In specific implementation, if the difference between the similarity of the first corpus and the similarity of the second corpus is not greater than the preset difference, the step S303 is continuously performed to determine the predicted text based on the next temporary recognition result.

In the embodiment of the present invention, the preset difference may be determined by a person skilled in the art according to the accuracy of the selected similarity algorithm and the actual requirement, and is not particularly limited.

For example, the temporary recognition result of this time is "introduce blue or green", and the target corpus set "introduced" includes the corpora "introduce blue or green island" and "introduce blue or green porcelain", so that the corpus with the highest similarity to the temporary recognition result "introduce blue or green" of this time is "introduce blue or green island", and the corpus with the highest similarity to the temporary recognition result "introduce blue or green" of this time is "introduce blue or green porcelain", at this time, a difference between the similarity corresponding to the first corpus "introduce blue or green island" and the similarity corresponding to the second corpus "introduce blue or green porcelain" is not greater than a preset difference, and then the corpus matching the temporary recognition result "introduce blue or green" of this time cannot be determined, and then the temporary recognition result of the next time is waited, and the prediction text is determined based on the temporary recognition result of the next time. If the next temporary recognition result is 'introduce the next Qingdao', the corpus with the highest similarity to the current temporary recognition result 'introduce the next Qingdao' is 'introduce the next Qingdao', and the corpus with the highest similarity to the current temporary recognition result 'introduce the next Qingdao' is 'introduce the next blue and white porcelain', at this moment, the difference value between the similarity corresponding to the first corpus 'introduce the next Qingdao' and the similarity corresponding to the second corpus 'introduce the next blue and white porcelain' is larger than the preset difference value, and therefore the 'introduce the next Qingdao' can be determined to be the corpus matched with the current temporary recognition result.

In the second way, if the corpora with the highest similarity to the adjacent multiple temporary recognition results in the corpus are the first corpora, and the trend of the similarity between the multiple temporary recognition results and the first corpora is increased first and then decreased, the first corpora is determined to be the corpora matched with the first temporary recognition result, and the first temporary recognition result is the temporary recognition result with the highest value of the similarity to the first corpora in the multiple temporary recognition results.

In a specific implementation, if the conditions listed in the second embodiment are not satisfied, the process continues to step S303, and the predicted text is determined based on the next temporary recognition result.

For example, it is assumed that the target corpus set "introduce" includes corpora such as "introduce Qingdao", "introduce Qingdao food", and the like. The corpus with the highest similarity to the nth temporary recognition result "introduce Qingdao" is "introduce Qingdao", the corpus with the highest similarity to the (n + 1) th temporary recognition result "introduce Qingdao American" is "introduce Qingdao food", and the predicted text is determined based on the (n + 2) th temporary recognition result because the corpus with the highest similarity to the adjacent two temporary recognition results is different. And determining the prediction text based on the n +3 th temporary recognition result if the corpus with the highest similarity to the n +2 th temporary recognition result ' introduce Qingdao food ' introduces Qingdao food ', and the change trend of the similarity between each temporary recognition result and the corpus ' introduce Qingdao food ' is gradually increased. The corpus with the highest similarity to the (n + 3) th temporary recognition result "introduce the Qingdao food today" introduces the Qingdao food ", and the (n + 3) th temporary recognition result" introduce the Qingdao food today "introduces the Qingdao food", the similarity to the corpus "introduce the Qingdao food" is lower than the similarity to the (n + 2) th temporary recognition result "introduce the Qingdao food" introduces the Qingdao food ", that is, the trend of the similarity to the corpus" introduce the Qingdao food "of each temporary recognition result is first increased and then decreased, because the similarity to the (n + 2) th temporary recognition result" introduce the Qingdao food "introduces the Qingdao food" is the highest, it can be determined that the corpus "introduces the Qingdao food" is the corpus matched with the (n + 2) th temporary recognition result "introduces the Qingdao food".

And in a third mode, if the similarity value between the temporary recognition result and a certain corpus in the target corpus set exceeds a second threshold value, determining the corpus as the corpus matched with the temporary recognition result.

Further, for the current temporary recognition result, when a plurality of corpora having similarity values with the current temporary recognition result exceeding a second threshold exist in the target corpus set, the corpus having the highest similarity with the current temporary recognition result may be selected from the plurality of corpora, and determined as the corpus matched with the current temporary recognition result.

Further, if the similarity between the current temporary recognition result and all the corpora in the target corpus set does not exceed the second threshold, step S303 is continuously executed to determine the predicted text based on the next temporary recognition result.

In the embodiment of the present invention, the second threshold may be determined by those skilled in the art according to the accuracy of the selected similarity algorithm and the actual requirement, and is not particularly limited.

On the basis of any of the above embodiments, the corpus matched with the temporary recognition result may be determined from the target corpus set by: if any corpus in the target corpus set includes any temporary recognition result after the step S302 is executed, determining any corpus as a corpus matched with any temporary recognition result.

In specific implementation, if only one corpus exists in the target corpus set and contains the temporary identification result, determining the corpus as a prediction text; if a plurality of corpora exist in the target corpus set and include the temporary recognition result, the step S303 is continuously executed, and the predicted text is determined based on the next temporary recognition result.

For example, the temporary recognition result is "introduce beijing", and assuming that the "introduce" target corpus includes the corpora "introduce beijing", "introduce gourmet of beijing", and "introduce the scenery spot of beijing", and all the three corpora include the temporary recognition result "introduce beijing", the predicted text is determined based on the next temporary recognition result. And if the next temporary recognition result is 'introduction of the food in Beijing', the temporary recognition result 'introduction of the food in Beijing' is only included in the corpus 'introduction of the target corpus' which only contains the corpus 'introduction of the food in Beijing', and the corpus 'introduction of the food in Beijing' is determined to be the prediction text.

Further, if any corpus in the target corpus set does not contain the temporary recognition result, indicating that the corpus matched with the temporary recognition result cannot be determined from the target corpus set, returning to execute step S302, and re-determining the target corpus set.

Based on any of the above embodiments, after the predicted text is determined, the corresponding response data can be determined according to the predicted text, the intelligent device is controlled to output the response data, meanwhile, the next predicted text is determined continuously based on the temporary recognition result obtained subsequently, and sentence breaking processing on the continuously input audio stream data is realized, so that a plurality of continuous sentences contained in the audio stream data are effectively distinguished, the predicted text corresponding to each sentence in the audio stream data input by the user can be predicted in real time, the intelligent device is controlled to make a timely response according to the predicted text, the response time of the intelligent device is shortened, and the user experience is improved.

The response data in the embodiment of the present invention is not limited to text data, audio data, image data, video data, voice broadcast, or control instructions, and the like, where the control instructions include but are not limited to: instructions for controlling the intelligent equipment to display expressions, instructions for controlling the motion of action components of the intelligent equipment (such as leading, navigation, photographing, dancing and the like) and the like.

On the basis of any of the above embodiments, if an invalid text is determined, matching the valid text excluding the invalid text in the temporary recognition result with any corpus in the target corpus set, and determining the matched corpus as the predicted text of the temporary recognition result.

For example, if the temporary recognition result is "where" a is blue and white porcelain "and it is determined that" a is blue and white porcelain "is an invalid text, matching the valid text excluding the invalid text in the temporary recognition result" where "a is blue and white porcelain" with any one of the linguistic data in the target linguistic data set "where" a is blue and white porcelain "is introduced, and determining the matched linguistic data as the prediction text of the temporary recognition result, which may refer to the foregoing embodiment and will not be described again.

On the basis of any one of the above embodiments, the embodiment of the present invention further includes the steps of: and adding a truncation identifier after the recognized text contained in the current temporary recognition result, wherein the recognized text is the temporary recognition result corresponding to the predicted text.

On the basis, when text prediction is carried out according to the temporary recognition result in the follow-up process, the corresponding corpus set is determined according to the text after the identification is cut off in the temporary recognition result for at least one time, and if the text after the identification is cut off in the temporary recognition result for any time is matched with any corpus in the determined corpus set, the matched corpus is determined as the predicted text of the text after the identification is cut off in the temporary recognition result.

For example, the server receives audio stream data uploaded by the smart device: "how the weather is suitable for going out and going out today is better. After the prediction text corresponding to the temporary recognition result 'how much the weather is today' is determined, determining the temporary recognition result 'how much the weather is today' as the recognized text; when the current temporary recognition result is "how the weather is suitable today", a truncation mark "/" is added after the recognized text "how the weather is suitable today" contained in the current temporary recognition result to obtain "how the weather is suitable/suitable today", and the text before the truncation mark "/" is a sentence with complete semantics. When text prediction is subsequently performed, the text after the identifier is truncated in the temporary recognition result is processed, for example, when the temporary recognition result is "how the weather is today/is suitable for going out and outing", prediction is performed according to the text after the identifier is truncated "is suitable for going out and outing", and the prediction text is "is suitable for going out and outing".

Through the embodiment, mutual interference among a plurality of continuous sentences in the audio stream data can be prevented in the text prediction and matching process.

On the basis of any of the above embodiments, before determining the corresponding corpus set according to at least one temporary recognition result for the first time, the method further includes the following steps: and determining that the number of characters contained in the temporary recognition result exceeds a second preset number.

In specific implementation, each time a temporary recognition result is obtained, whether the number of characters included in the temporary recognition result exceeds a second preset number is judged, if the number of characters included in the temporary recognition result exceeds the second preset number, the step S302 is executed, otherwise, the step S302 is not executed, and the next temporary recognition result is waited.

In the embodiment of the present invention, the second preset number may be configured according to actual requirements, and is not specifically limited. For example, the word number of the shortest feature word in the corpus may be determined, and if the word number of the shortest feature word is 3, the value of the second preset number is not greater than 3.

On the basis of any one of the above embodiments, the embodiment of the present invention further includes the steps of: if a voice end point is detected in the audio stream data collected by the intelligent device in real time, emptying the temporary recognition result obtained before, and returning to step S201 (or step S301).

The voice end point in the embodiment of the present invention refers to a time when the user voice in the audio stream data ends. In particular implementations, the voice end point may be detected by VAD techniques.

For example, the audio stream data received by the server and collected by the smart device is: "today do how well it fits to go out to outing … … vegetation garden is not opened", wherein "do it fit to go out to outing" is followed by a speech end point. According to the time sequence of the audio stream data, the following temporary recognition results are obtained in sequence: "today", "today day" … … "today is suitable for going out to the outing, when" suitable for going out to the outing "is recognized and then is a voice end point, the temporary recognition result" how today is suitable for going out to the outing "is emptied, and the temporary recognition result is obtained continuously based on subsequent audio stream data: "plant" and "plant" … ….

As another possible implementation manner, step S202 specifically includes: and selecting a corpus candidate matched with the temporary recognition result from the corpus to obtain a corpus set.

To this end, referring to fig. 5, another speech signal processing method according to an embodiment of the present invention includes the following steps:

s501, voice recognition is carried out on audio stream data collected by the intelligent device in real time, and a temporary recognition result is obtained.

S502, selecting a corpus candidate matched with the temporary recognition result from the corpus to obtain a corpus set.

And S503, if any subsequent temporary recognition result is matched with any corpus in the corpus set determined in the step S502, determining the matched corpus as a prediction text of the temporary recognition result.

For convenience of description, the corpus set determined in step S502 is referred to as a corpus candidate set.

In specific implementation, the corpus including the temporary recognition result matching is selected from the corpus as a candidate corpus based on one or more ways such as a keyword retrieval technology, a similarity algorithm, a fuzzy matching algorithm and the like. Specifically, the corpus of which the sentence head is the temporary recognition result may be used as the candidate corpus.

Further, if the number of the candidate corpora does not exceed the first preset number, a set composed of all the candidate corpora can be used as a candidate corpus set; and if the number of the candidate linguistic data exceeds a first preset number, selecting the linguistic data with the first preset number from the candidate linguistic data according to a preset strategy to obtain a candidate linguistic data set.

In the embodiment of the present invention, the first preset number may be preconfigured by a person skilled in the art according to actual needs, and a value of the first preset number is not limited herein.

On this basis, a first preset number of corpora can be selected from the corpus candidates according to the following preset strategy to obtain a corpus candidate set:

in the first mode, if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sequenced according to the text length of each candidate corpus, and the first preset number of the candidate corpuses with the front sequencing is selected to obtain a candidate corpus set.

And in the second mode, if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sequenced according to the hit times of the candidate corpuses, and the first preset number of the candidate corpuses with the front sequencing is selected to obtain a candidate corpus set.

In the implementation of the present invention, the number of times a corpus is hit means the number of times the corpus is determined as a predicted text. In the process of using the corpus, the number of times of hitting each corpus can be updated in real time or periodically. Specifically, the number of times that the corpus is hit may be the number of times that the corpus is hit in a specified period, for example, the specified period may be one day, one week, and the like, and when the specified period is exceeded, the number of times that each corpus is hit counted before emptying is counted, and the number of times that each corpus is hit in the corpus is counted again.

On the basis of any of the above embodiments, step S503 specifically includes: if the next temporary recognition result is consistent with any one of the corpora in the corpus set determined in step S502, determining the corpus as a predicted text of the next temporary recognition result.

In specific implementation, whether the next temporary recognition result is consistent with any corpus in the corpus candidate set or not can be determined according to the following modes: and calculating the similarity between the next temporary recognition result and the corpus aiming at any corpus in the candidate corpus set, if the similarity exceeds a similarity threshold, determining that the next temporary recognition result is consistent with the corpus, and if the similarity does not exceed the similarity threshold, determining that the next temporary recognition result is inconsistent with the corpus.

In specific implementation, the similarity between the temporary recognition result and each corpus in the corpus candidate set may be calculated by any existing similarity algorithm, for example, a text similarity algorithm, an euclidean distance, or the like.

In the embodiment of the present invention, the specific value of the similarity threshold may be determined by a person skilled in the art based on the specific requirements of the selected similarity algorithm, such as precision, recognition accuracy, text generalization capability, and the like, in combination with practical experience, and the embodiment of the present invention is not limited.

Further, the method of the embodiment of the present invention further includes the following steps: and if the next temporary recognition result is not consistent with all the linguistic data in the candidate corpus set, re-determining the candidate corpus set according to the next temporary recognition result.

In specific implementation, the corpus candidate set can be determined again according to the next temporary recognition result in the following way: selecting a candidate corpus matched with a next temporary recognition result from the corpus to obtain a first candidate set; and selecting candidate corpora different from the corpora contained in the previously determined candidate corpus set from the first candidate set, and adding the selected candidate corpora into the candidate corpus set to obtain a new candidate corpus set.

As a possible implementation manner, the corpus candidate may be selected from the corpus to obtain a first candidate set, a first preset number of corpus candidates different from the corpora included in the previously determined corpus candidate set are selected from the first candidate set, and the selected corpus candidate is added to the corpus candidate set to obtain a new corpus candidate set.

For example, the first preset number is 3, the first candidate set includes 10 corpus candidates matched with the next temporary recognition result, 3 corpus not included in the previously determined corpus candidate set are selected from the 10 corpus candidates, and the 3 selected corpus are added to the previously determined corpus candidate set to obtain a new corpus candidate set.

Specifically, the corpus candidates not exceeding the first preset number may be selected from the first candidate set in the following manner:

determining candidate corpora which are different from corpora contained in the previously determined candidate corpus set in the first candidate set; if the number of the determined candidate corpora does not exceed the first preset number, directly adding the determined candidate corpora into the previously determined candidate corpus set; and if the number of the determined candidate linguistic data exceeds a first preset number, selecting the first preset number of candidate linguistic data from the determined candidate linguistic data according to a preset strategy, and adding the selected candidate linguistic data into a candidate linguistic data set to obtain a new candidate linguistic data set. The preset policy may be: and sequencing the determined candidate linguistic data according to the text length or hit frequency of the determined candidate linguistic data, and selecting a first preset number of candidate linguistic data which are sequenced in the front.

As another possible implementation manner, according to a preset strategy, a first preset number of candidate corpora matched with the next temporary recognition result may be selected from the corpus to obtain a first candidate set, candidate corpora different from corpora included in the previously determined candidate corpus set are selected from the first candidate set, and the selected candidate corpora are added to the candidate corpus set to obtain a new candidate corpus set.

Further, if the number of the language candidates added to the language candidate set is smaller than a first preset number, continuously selecting the language candidates with the first preset number matched with the next temporary recognition result from the corpus according to a preset strategy to obtain a new first candidate set, selecting the language candidates different from the language candidates contained in the previously determined language candidate set from the first candidate set, and adding the selected language candidates to the language candidate set to obtain a new language candidate set. If the total number of the language material candidates added into the language material candidate set twice is still smaller than the first preset number, repeating the above steps until the total number of the language material candidates added into the language material candidate set is equal to the first preset number. Specifically, if the corpus candidate is added to the corpus candidate set last time, the sum of the number M of the corpus candidates selected from the first corpus candidate and the total number M of the corpus candidates added to the corpus candidate set before is greater than a first preset number K, then N corpus candidates may be selected from the first corpus candidate set according to a preset policy, and the N corpus candidates are added to the corpus candidate set to obtain a new corpus candidate set, where N ═ K- (M + M).

For example, T₀At the moment, a temporary recognition result L is obtained₀According to a preset strategy, a first preset number of inclusion L is selected from the corpus₀Obtaining a corpus candidate set Q₀。

T₁At the moment, obtaining a temporary recognition result L₀+L₁Judging a corpus candidate set Q₀Whether or not to exist with L₀+L₁Consistent corpora, if present, indicate L₀+L₁Is a sentence with complete semantics, at this time, the corpus candidate set Q is collected₀Neutral and L₀+L₁Consistent corpora as temporary recognition results L₀+L₁The predicted text of (1). If the corpus candidate set Q₀Is absent from L₀+L₁Selecting a first preset number of L-contained corpora from the corpus according to a preset strategy if the corpora are consistent₀+L₁And with the corpus candidate set Q₀Adding the selected candidate corpus into a candidate corpus set Q₀In the method, a new corpus candidate set Q is obtained₁。

T₂At the moment, obtaining a temporary recognition result L₀+L₁+L₂Judging a corpus candidate set Q₁Whether or not to exist with L₀+L₁+L₂Consistent corpora, if present, indicate L₀+L₁+L₂Is a sentence with complete semantics, at this time, the corpus candidate set Q is collected₁Neutral and L₀+L₁+L₂Consistent corpora as temporary recognition results L₀+L₁+L₂The predicted text of (1). If the corpus candidate set Q₁Is absent from L₀+L₁+L₂Selecting a first preset number of L-contained corpora from the corpus according to a preset strategy if the corpora are consistent₀+L₁+L₂And with the corpus candidate set Q₁Adding the selected candidate corpus into a candidate corpus set Q₁In (1),obtaining a new corpus candidate set Q₂。

For example, the server receives audio stream data sent by the smart device: "introduction of food in Beijing", the first predetermined amount is 3. T is₀At the moment, the temporary recognition result is 'introduced', and a candidate corpus set Q is obtained₀The Chinese traditional medicine comprises the linguistic data of 'introducing Beijing', 'introducing delicacy of Beijing' and 'introducing Tianjin'; t is₁At any moment, the temporary recognition result is 'introduce Beijing', and the candidate corpus set Q₀The existence of the linguistic data consistent with the temporary recognition result 'introduction of Beijing' indicates that the temporary recognition result 'introduction of Beijing' is a sentence with complete semantics, and at the moment, the candidate linguistic data set Q is integrated₀The corpus in (1) introduces Beijing as a temporary recognition result and introduces the prediction text of Beijing. T is₂At the moment, the temporary recognition result is 'introduction to Beijing', and the corpus candidate set Q₀If there are no corpora consistent with the temporary recognition result, then according to the preset strategy, selecting 3 corpora containing 'introducing Beijing' and the candidate corpus set Q₀For example, the language candidates may be "introduction of scenery spot of Beijing", "introduction of history of Beijing", "introduction of snack of Beijing", and the selected language candidates are added to the set of language candidates Q₀In the method, a new corpus candidate set Q is obtained₁。T₃At the moment, the temporary recognition result is 'introduction of the food of Beijing', and the corpus candidate set Q₁The existence of the linguistic data consistent with 'introduction of the food in Beijing' indicates that 'introduction of the food in Beijing' is a sentence with complete semantics, and at the moment, the candidate linguistic data set Q is integrated₁The language material consistent with the 'introduction of the food in Beijing' is used as a prediction text of the temporary recognition result 'introduction of the food in Beijing'.

In specific implementation, the above steps may be executed in a loop, and the corpus candidate set is updated continuously based on the new temporary recognition result, until a speech end point is detected in the audio stream data, the corpus candidate set is not emptied. The voice recognition method comprises the steps of recognizing a voice starting point and a voice ending point in audio stream data by utilizing a Voice Activity Detection (VAD) technology, performing voice recognition on the audio stream data corresponding to the voice starting point, continuously expanding a candidate corpus set in the process, obtaining at least one predicted text until the voice ending point is detected, and emptying the candidate corpus set.

Based on the method, before the voice end point is detected, multiple times of predicted texts can be determined according to continuous audio stream data, and then the intelligent device is controlled to make corresponding responses in time according to each predicted text. Therefore, in the method of the embodiment, the corpus set determined in real time in the prediction process is used for detecting and predicting the sentences with complete semantics in the audio stream data in real time, continuously and quickly, so that a plurality of continuous sentences contained in the audio stream data are effectively distinguished, and therefore, the predicted text can be determined without waiting for the end point of the voice, the intention of a user can be understood quickly, the intelligent device is controlled to make a corresponding response, and the voice traffic is more instant, free and smooth.

On the basis, when text prediction is carried out according to the temporary recognition result in the follow-up process: determining a corresponding corpus set according to the text after the identifier is truncated in the at least one temporary recognition result; and if the text after the mark is cut off in any temporary recognition result is matched with any language material in the determined language material set, determining the matched language material as the predicted text of the text after the mark is cut off in the temporary recognition result.

T₁At the moment, obtaining a temporary recognition result L₀+L₁Judgment of waiting timeCorpus collection Q₀Whether or not to exist with L₀+L₁Consistent corpora, if present, indicate L₀+L₁Is a sentence with complete semantics, at this time, the corpus candidate set Q is collected₀Neutral and L₀+L₁Consistent corpora as temporary recognition results L₀+L₁The recognized text is L₀+L₁。

T₂At the moment, obtaining a temporary recognition result L₀+L₁+L₂Recognized text L in the temporary recognition result₀+L₁Adding a truncation mark later, and based on the text L after the truncation mark₂Performing text prediction, namely: according to a preset strategy, selecting a first preset number of L-contained objects from a corpus₂And with the corpus candidate set Q₀Adding the selected candidate corpus into a candidate corpus set Q₀In the method, a new corpus candidate set Q is obtained₁。

T₃At the moment, obtaining a temporary recognition result L₀+L₁+L₂+L₃Wherein L is₀+L₁Then, a truncation mark is provided, and a corpus candidate set Q is judged₁Whether or not to exist with L₂+L₃Consistent corpora, if present, indicate L₂+L₃Is a sentence with complete semantics, at this time, the corpus candidate set Q is collected₂Neutral and L₂+L₃Consistent corpora as temporary recognition results L₂+L₃The predicted text of (1). If the corpus candidate set Q₁Is absent from L₂+L₃Selecting a first preset number of L-contained corpora from the corpus according to a preset strategy if the corpora are consistent₂+L₃And with the corpus candidate set Q₁Adding the selected candidate corpus into a candidate corpus set Q₁In the method, a new corpus candidate set Q is obtained₂。

For example, the server receives audio stream data sent by the smart device: "how suitable today is for outing”T₀At any moment, the temporary recognition result is 'today weather', and a candidate corpus set Q is obtained₀The middle contains the linguistic data "how much the weather is today"; t is₁At that time, the temporary recognition result is 'what is the weather today', and the corpus candidate set Q₀The linguistic data consistent with the temporary recognition result 'how the weather is today' exists, which shows that the temporary recognition result 'how the weather is today' is a sentence with complete semantics, and at the moment, the candidate linguistic data set Q is obtained₀The corpus "how like the weather today" in (1) is used as a prediction text of the temporary recognition result "how like the weather today", and the recognized text is "how like the weather today". T is₂At this time, the temporary recognition result is "how the weather is suitable today", a truncation flag "/" is added after the recognized text "how the weather is suitable today" in the temporary recognition result to obtain "how the weather is suitable today", text prediction is performed based on the text "suitable" after the truncation flag, that is: according to a preset strategy, selecting a first preset number of candidate linguistic data containing 'suitable' from a corpus to obtain a candidate linguistic data set Q₂。T₃At any moment, obtaining a temporary identification result of 'how much weather is today/suitable for outing', and judging the candidate corpus set Q₂If yes, the fact that the language material is a sentence with complete semantics is indicated, and at the moment, a candidate language material set Q is used₂The corpus consistent with the language material of the 'suitable for outing' is used as a prediction text of the temporary recognition result 'suitable for outing'.

In specific implementation, each time a temporary recognition result is obtained, whether the number of characters included in the temporary recognition result exceeds a second preset number is judged, if the number of characters included in the temporary recognition result exceeds the second preset number, the step S502 is executed, otherwise, the step S502 is not executed, and the next temporary recognition result is waited.

In the embodiment of the present invention, the second preset number may be configured according to actual requirements, and is not specifically limited.

On the basis of any one of the above embodiments, the embodiment of the present invention further includes the steps of: if a voice end point is detected in the audio stream data collected by the intelligent device in real time, the temporary recognition result obtained before the voice end point is cleared, and the process returns to step S501.

As shown in fig. 6, based on the same inventive concept as the above-mentioned speech signal processing method, an embodiment of the present invention further provides a speech signal processing apparatus 60, including: a speech recognition module 601, a determination module 602 and a prediction module 603.

The voice recognition module 601 is configured to perform voice recognition on audio stream data acquired by the intelligent device in real time to obtain a temporary recognition result;

a determining module 602, configured to determine a corresponding corpus set according to at least one temporary recognition result, where the corpus set includes at least one corpus;

the predicting module 603 is configured to, if any subsequent temporary recognition result matches any corpus in the corpus set, determine the matched corpus as a predicted text of the temporary recognition result.

Optionally, the determining module 602 is specifically configured to:

if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sequenced according to the text length of each candidate corpus, and a first preset number of candidate corpuses with the front sequencing are selected to obtain a corpus set; or if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sorted according to the hit times of the candidate corpuses, and the first preset number of the candidate corpuses with the front sorting is selected to obtain a corpus set.

Optionally, the prediction module 603 is specifically configured to:

Optionally, the determining module 602 is further configured to:

Optionally, the determining module 602 is specifically configured to:

and selecting candidate corpora different from the corpora contained in the previously determined corpus set from the first candidate set, and adding the selected candidate corpora to the candidate corpus set.

Optionally, the determining module 602 is specifically configured to:

if the characteristic words corresponding to any corpus set are consistent with at least part of texts contained in the temporary recognition result, determining the corpus set as a corpus set corresponding to the temporary recognition result; or if the similarity between the feature words corresponding to any corpus set and the temporary identification result is higher than a first threshold, determining the corpus set as the corpus set corresponding to the temporary identification result.

Optionally, the determining module 602 is specifically configured to:

if the first feature word matched with the current temporary recognition result is different from the second feature word matched with the last temporary recognition result, and the similarity between the first feature word and the current temporary recognition result is higher than the similarity between the second feature word and the last temporary recognition result, determining that the last temporary recognition result is an invalid text; or if the temporary recognition result contains a preset high-frequency word and the similarity corresponding to the first characteristic word matched with the temporary recognition result is higher than the similarity corresponding to the second characteristic word matched with the temporary recognition result at the last time, determining that the text before the high-frequency word contained in the temporary recognition result is an invalid text.

Optionally, the predicting module 603 is specifically configured to determine that any subsequent temporary recognition result matches any corpus in the corpus set according to the following manner:

determining the corpus matched with any one-time temporary recognition result according to the similarity between any one-time temporary recognition result and any one corpus in the corpus set; or if any corpus in the corpus set contains any next temporary recognition result, determining that the any corpus is the corpus matched with the any temporary recognition result.

Optionally, the prediction module 603 is specifically configured to:

if the difference between the similarity of the first corpus and the similarity of the second corpus is larger than a preset difference, determining that the first corpus is the corpus matched with the temporary identification result, wherein the first corpus is the corpus with the highest similarity with the temporary identification result in the corpus set, and the second corpus is the corpus with the highest similarity with the temporary identification result in the corpus set; or, if the corpora with the highest similarity to the multiple adjacent temporary recognition results in the corpus are the first corpora respectively, and the trend of the similarity between the multiple temporary recognition results and the first corpora is increased and then decreased, determining that the first corpora is the corpora matched with the first temporary recognition result, and the first temporary recognition result is the temporary recognition result with the highest similarity value to the first corpora in the multiple temporary recognition results.

Optionally, the determining module 602 is further configured to:

Optionally, the apparatus further comprises an emptying module, configured to:

if a voice end point is detected in the audio stream data collected by the intelligent device in real time, emptying the obtained temporary recognition result, and returning to execute the function of the voice recognition module 601.

The voice signal processing device and the voice signal processing method provided by the embodiment of the invention adopt the same inventive concept, can obtain the same beneficial effects, and are not described again.

Based on the same inventive concept as the voice signal processing method, an embodiment of the present invention further provides an electronic device, which may specifically be a control device or a control system inside an intelligent device, or an external device communicating with the intelligent device, such as a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 7, the electronic device 70 may include a processor 701 and a memory 702.

Memory 702 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In an embodiment of the present invention, the memory may be used to store a program of a voice signal processing method.

The processor 701 may be a CPU (central processing unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a CPLD (Complex Programmable Logic Device), and implements the voice signal processing method in any of the above embodiments according to an obtained program instruction by calling a program instruction stored in a memory.

An embodiment of the present invention provides a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the voice signal processing method.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

Based on the same inventive concept as the speech signal processing method, an embodiment of the present invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, implement the speech signal processing method in any of the above embodiments.

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present invention, and should not be construed as limiting the embodiments of the present invention. Variations or substitutions that may be readily apparent to one skilled in the art are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A speech signal processing method, comprising:

2. The method according to claim 1, wherein the determining a corresponding corpus set according to at least one temporary recognition result specifically includes:

3. The method according to claim 2, wherein the selecting the corpus from the corpus that matches the temporary recognition result to obtain a corpus set specifically includes:

if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, the candidate corpuses are sequenced according to the text length of each candidate corpus, and a first preset number of candidate corpuses with the front sequencing are selected to obtain a corpus set; or

And if the number of the candidate corpuses matched with the temporary recognition result exceeds a first preset number, sorting the candidate corpuses according to the number of times of hitting of each candidate corpus, and selecting the first preset number of candidate corpuses with the front sorting to obtain a corpus set.

4. The method according to any one of claims 1 to 3, wherein if any one of the temporary recognition results is matched with any one of the corpora in the corpus set, determining the matched corpora as the predicted text of the temporary recognition result, specifically comprising:

5. The method according to claim 1, wherein the determining a corresponding corpus set according to at least one temporary recognition result specifically includes:

6. The method according to claim 5, wherein determining a corpus set according to the feature words and at least one temporary recognition result corresponding to each corpus set included in the corpus specifically includes:

if the characteristic words corresponding to any corpus set are consistent with at least part of texts contained in the temporary recognition result, determining the corpus set as a corpus set corresponding to the temporary recognition result; or

And if the similarity between the characteristic words corresponding to any one corpus set and the temporary identification result is higher than a first threshold value, determining the corpus set as the corpus set corresponding to the temporary identification result.

7. The method according to claim 5 or 6, wherein any subsequent temporary recognition result is determined to match any corpus in the corpus set according to the following method:

8. A speech signal processing apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.