CN113380254A

CN113380254A - Voice recognition method, device and medium based on cloud computing and edge computing

Info

Publication number: CN113380254A
Application number: CN202110686792.2A
Authority: CN
Inventors: 张光强; 王珂
Original assignee: Unisyou Technology Shenzhen Co ltd
Current assignee: Zaozhuang Fuyuan Network Technology Co ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-09-10
Anticipated expiration: 2041-06-21
Also published as: CN113380254B

Abstract

The invention provides a voice recognition method, equipment and medium based on cloud computing and edge computing, and relates to the field of voice recognition. A voice recognition method based on cloud computing and edge computing comprises the following steps: collecting voice information; processing and calculating the voice information through the cloud, and outputting a first voice recognition result and a first recognition confidence coefficient T1 based on the cloud voice database; processing and calculating the voice information locally, and outputting a second voice recognition result and a second recognition confidence coefficient T2 based on a local edge-end voice database; and comparing the first voice recognition result with the second voice recognition result through a voice scene recognition model algorithm, and outputting a first cloud scene result S1 and a first local scene result S2. The method and the device can adapt to the environment with poor network, improve the accuracy of voice recognition and improve the user experience.

Description

Voice recognition method, device and medium based on cloud computing and edge computing

Technical Field

The invention relates to the field of voice recognition, in particular to a voice recognition method, voice recognition equipment and voice recognition media based on cloud computing and edge computing.

Background

Cloud computing is one of distributed computing, and refers to an operation processing method for processing data through computing to obtain a result and returning the result to a user through a network cloud. Edge computing refers to providing the nearest computing service nearby on the side near the source of the data. The goal of speech recognition is to convert the lexical content of human speech into computer-readable input, such as binary codes or character sequences.

At present, a voice recognition scheme generally has two modes, one mode is that related voice recognition is performed through a cloud recognition engine, and under the condition that a network is not good, voice recognition cannot be performed continuously, so that user experience is poor, and at the moment, corresponding voice interaction experience is poor. The other mode is that the local recognition engine and the cloud recognition engine are combined to work, the cloud recognition mode is adopted to recognize voice results when the network condition is good, the local recognition engine is used to recognize when the network condition is not good, the normal operation of voice interaction is guaranteed by combining the two methods, excellent interaction experience cannot be guaranteed due to the fact that the local voice recognition capability is weak, fine input sentence breaking processing is not achieved when the two modes are switched, and the situation that interaction is not smooth always occurs.

Disclosure of Invention

The invention aims to provide a voice recognition method based on cloud computing and edge computing, which can be suitable for an environment with a poor network, improve the accuracy of voice recognition and improve user experience.

Another objective of the present invention is to provide an electronic device, which is capable of adapting to an environment with a poor network, so as to improve accuracy of speech recognition and improve user experience.

Another object of the present invention is to provide a computer-readable storage medium, which can adapt to an environment with a poor network, improve accuracy of speech recognition, and improve user experience.

The embodiment of the invention is realized by the following steps:

in a first aspect, an embodiment of the present application provides a speech recognition method based on cloud computing and edge computing, including the following steps:

(1) collecting voice information;

(2) processing and calculating the voice information through a cloud end, and outputting a first voice recognition result and a first recognition confidence coefficient T1 based on a cloud end voice database;

(3) locally processing and calculating the voice information, and outputting a second voice recognition result and a second recognition confidence coefficient T2 based on a local edge-end voice database;

(4) comparing the first voice recognition result with the second voice recognition result through a voice scene recognition model algorithm, and outputting a first cloud scene result S1 and a first local scene result S2;

(5) comparing the first recognition confidence T1 and the second recognition confidence T2 through a speech scene recognition model algorithm, and outputting a first recognition result confidence T1 and a second recognition result confidence T2;

(6) judging whether S1 is true or not, judging whether T1T 1> T2T 2 is true or not when S1 is true or not and judging whether T1T 1> T2T 2 is true or not when S2 is true, and outputting a first cloud scene result S1 when the S1 is true, otherwise, executing cloud scene recognition and outputting a second cloud scene result; and when the result of S1 is not satisfied with S2, judging whether T1T 1< T2T 2 is satisfied, and outputting the first cloud scene result S2 when the result is satisfied, otherwise, executing local scene recognition and outputting a second local scene result.

In some embodiments of the present invention, the step (1) includes uploading the collected voice information to a cloud and a local area after noise elimination and human voice detection.

In some embodiments of the present invention, in the step (1), it is determined whether a network exists, and when the network exists, the voice information is uploaded to the cloud and the local, respectively, and when the network does not exist, the voice information is uploaded to the local.

In some embodiments of the present invention, in the step (5), it is determined whether the first recognition confidence T1 is much greater than the second recognition confidence T2, and if so, local scene recognition is performed and a second local scene result is output, otherwise, step (6) is performed.

In some embodiments of the present invention, in the step (2), the processing and calculating the voice message through the cloud includes recognizing the voice message through a cloud voice database; in the step (3), the processing and the calculation of the voice information locally comprise recognizing the voice information through a local voice database.

In some embodiments of the present invention, in the step (4), the first cloud scene result S1 is obtained by performing intent scene positioning on the first voice recognition result, and the first local scene result S2 is obtained by performing intent scene positioning on the second voice recognition result.

In some embodiments of the present invention, in the step (4), when the first local scene result S2 is empty, performing cloud scene recognition and outputting the second cloud scene result, otherwise, determining whether T1 × T1> T2 × T2 is true, and outputting the first cloud scene result S1 when the result is true, otherwise, performing local scene recognition and outputting the second cloud scene result.

In some embodiments of the present invention, the above speech recognition method based on cloud computing and edge computing further includes the step (7): and feeding back and outputting voice interaction information according to the output result.

In a second aspect, an embodiment of the present application provides an electronic device, including: a memory for storing one or more programs; a processor; the one or more programs, when executed by the processor, implement the method of any of the above first aspects.

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method according to any one of the first aspect.

Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:

(1) collecting voice information;

With respect to the first aspect: according to the embodiment of the application, the voice information is collected, and the cloud end and the local are utilized to process and calculate the voice information respectively, so that the voice information can be used when the network is unstable, and the experience of a user in voice interaction is improved; the method comprises the steps that a first voice recognition result and a first recognition confidence coefficient are output based on a cloud voice database, a second voice recognition result and a second recognition confidence coefficient are output based on a local voice database, the content of voice information can be recognized by the voice database, the accuracy of current voice recognition is reflected by the confidence coefficient, the first voice recognition result and the second voice recognition result are compared through a voice scene recognition model algorithm, the accuracy of a cloud scene result and the accuracy of a local scene result are judged according to a logical relation after the cloud scene result and the local scene result are output, a cloud or local scene execution corresponding result is selected according to the accuracy, and the accuracy of voice recognition can be improved; the voice recognition result and the confidence coefficient are compared with the cloud end and the voice information obtained locally, so that the problem of incomplete information caused by cloud end and local switching can be solved in the weak network environment such as the subway. According to the method, the voice recognition results of the cloud scene and the local scene are compared by using the voice scene recognition model algorithm, so that the accuracy of voice recognition can be improved by using user habits; a plurality of recognition results of different voice information are output by utilizing the cloud and the local scene, so that the method can be used for upgrading the voice database, and further accurately recognizing the voice by utilizing the habit of the user. The invention can adapt to the weak network environment and improve the accuracy of voice recognition, thereby improving the experience of the user in using voice interaction.

With respect to the second to third aspects: the principle of the embodiments of the present application is the same as that of the first aspect, and a repeated description thereof is not necessary.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic diagram illustrating a first principle of a speech recognition method based on cloud computing and edge computing according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a principle of a speech recognition method based on cloud computing and edge computing according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it is also to be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the individual features of the embodiments can be combined with one another without conflict.

Example 1

Referring to fig. 1 to 2, fig. 1 to 2 are schematic flow charts illustrating a speech recognition method based on cloud computing and edge computing according to an embodiment of the present application. The voice recognition method based on cloud computing and edge computing comprises the following steps:

(1) collecting voice information;

In the step (1), the voice information can be sent out during voice control of people, the collected equipment can be a receiver, the receiver is sent to the cloud end through the network for identification, and the receiver is sent to the local for identification through an electric signal. In the step (2), the voice information is processed and calculated through the cloud, and the voice information can be obtained by utilizing a cloud voice database, wherein the voice content and confidence of the recognized various voice information are stored in the cloud voice database. Optionally, the voice content and the confidence level stored in the cloud voice database are obtained through a cloud voice data model, wherein the cloud voice data model is formed by training a plurality of groups of voice data machines. And each set of voice data includes voice information and voice content and confidence for recognizing the voice information. Therefore, different voice information is recorded into the cloud voice data model, and the content of the voice information can be recognized to obtain a voice recognition result and the confidence coefficient of the recognition result in the scene. In the step (3), after the voice information is processed and calculated locally, the voice information can be obtained through a local scene voice database. Wherein the local scene speech database is available through the local speech data model. Moreover, the local scene voice database and the cloud scene voice database are the same, and the local voice data model and the cloud voice data model are the same, so that repeated description is not needed. The confidence value reflects the accuracy of the current voice recognition, is related to factors such as the current network condition, the packet receiving error rate, the voice algorithm recognition rate and the like, and the higher the value is, the more accurate the voice recognition is.

And (4) comparing recognition results obtained by the local and cloud terminals, wherein the voice scene recognition model algorithm comprises a plurality of groups of recognition data, and each group of data comprises voice information, cloud terminal content used for marking the voice information content and local content. The local content is obtained through different voice information in the local scene, and the cloud content is obtained through different voice information in the cloud scene, so that the accuracy of voice information identification is further improved. Comparing the cloud end and the local voice recognition result through the voice scene recognition model algorithm to obtain a recognition result with higher similarity to the voice scene recognition model algorithm, and screening out the recognition result which is more in line with the semantic scene after comparison through comparing the cloud end and the local voice recognition result. And performing logic operation according to the identification results and the confidence degrees of the local and cloud ends, selecting a cloud end or local preferred scheme, and outputting a final debugged result.

In detail, the recognized voice information can be used for further training of the models and can also be used in the field of voice interaction, and the voice information input by the user can be accurately fed back, so that the experience degree of the user in voice interaction is improved.

In detail, the collected voice is subjected to noise elimination or voice reinforcement in a noise elimination and voice recognition mode, and the processed result is input to the local and cloud for proceeding pessimism, so that the accuracy of semantic recognition voice information is improved.

In the step (1), whether a network exists is judged, so that the voice information is uploaded to the cloud for recognition when the network exists, and is uploaded to the local when the network does not exist, and voice feedback can be performed according to the content after the voice recognition is performed by utilizing the local, so that the method is suitable for non-network use and is convenient for voice information backup. The uploaded local can be stored through a memory of the server side, and the server can upload the uploaded local to the cloud server through the gateway.

In detail, it is determined whether the first recognition confidence T1 is greater than the second recognition confidence T2, and if so, local scene recognition is performed and a result after the local scene recognition is output, otherwise, the next step is performed, that is, it is determined whether S1 is equal to S2. When the local confidence coefficient is far greater than that of the cloud, the local scene is preferentially selected for recognition, and the executed recognition result is directly output, so that a more accurate recognition result can be obtained.

In detail, the voice information is processed through the cloud end, the voice information is recognized through the cloud end voice database, a voice recognition result is obtained, the voice information is processed locally, the cloud end and the local independently utilize big data to recognize voice, and the follow-up rules that the big data are reused in the comparison process to recognize the same voice information are obtained, so that a final recognition result is obtained, and the accuracy of the recognition result is greatly improved. The above-described contents of the process of recognizing voice information by a voice database have been explained in detail, and a repeated description thereof is not necessary.

The identification result by using the intention scene positioning processing is the prior art, and is not limited in detail here. Optionally, the intention scene recognition method based on the voice information may convert the voice information into text by performing multi-modal input conversion on the multi-modal input. And performing intention recognition according to the converted text, and scoring each intention obtained. Intent recognition may be performed using conventional means; carrying out similarity calculation on the voice information and each candidate voice information; and taking the calculated similarity as another score, carrying out weighted addition and rearrangement on the calculated similarity and the score obtained in the traditional mode, and taking the candidate intention with the highest score as the final output intention. The voice information multi-modal input can be in the form of audio, image, text and the like. The processing scene result can be enhanced through the intention scene positioning, and the accuracy of the scene result is further improved.

In detail, when the first local scene result does not exist, cloud scene recognition is executed and the result is output, otherwise, whether the accuracy of the cloud is high or not is judged, and otherwise, the local scene output result is switched to be executed. The second cloud scene result is automatically corrected, and accuracy of the voice recognition technology is improved.

In detail, feedback is performed according to a cloud or local output result, the output content can be voice communication or a result searched according to voice information, and the method can be applied to a voice recognition technology in the field of artificial intelligence.

According to the method, the interaction experience of a specific scene of a user can be smoothly finished through the switching of local voice recognition and cloud voice recognition, and the recognition of voice information with more contents in an unstable network state is met; by comparing the recognition results of the local and the cloud, the optimal result is selected for output, and the accuracy of the recognition result is improved.

Example 2

An embodiment of the present application provides an electronic device, including: a memory for storing one or more programs; a processor; the one or more programs, when executed by the processor, implement the method as described in embodiment 1 above.

The memory, processor and communication interface are electrically connected to each other, directly or indirectly, to enable transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to implementation of the cloud computing and edge computing based speech recognition method provided in the embodiments of the present application, and the processor may execute various functional applications and data processing by executing the software programs and modules stored in the memory. The communication interface may be used for communicating signaling or data with other node devices.

The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowcharts in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or by combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

To sum up, a voice recognition method, device, and medium based on cloud computing and edge computing provided by the embodiments of the present application:

according to the embodiment of the application, the voice information is collected, and the cloud end and the local are utilized to process and calculate the voice information respectively, so that the voice information can be used when the network is unstable, and the experience of a user in voice interaction is improved; the method comprises the steps that a first voice recognition result and a first recognition confidence coefficient are output based on a cloud voice database, a second voice recognition result and a second recognition confidence coefficient are output based on a local voice database, the content of voice information can be recognized by the voice database, the accuracy of current voice recognition is reflected by the confidence coefficient, the first voice recognition result and the second voice recognition result are compared through a voice scene recognition model algorithm, the accuracy of a cloud scene result and the accuracy of a local scene result are judged according to a logical relation after the cloud scene result and the local scene result are output, a cloud or local scene execution corresponding result is selected according to the accuracy, and the accuracy of voice recognition can be improved; the voice recognition result and the confidence coefficient are compared with the cloud end and the voice information obtained locally, so that the problem of incomplete information caused by cloud end and local switching can be solved in the weak network environment such as the subway. According to the method, the voice recognition results of the cloud scene and the local scene are compared by using the voice scene recognition model algorithm, so that the accuracy of voice recognition can be improved by using user habits; a plurality of recognition results of different voice information are output by utilizing the cloud and the local scene, so that the method can be used for upgrading the voice database, and further accurately recognizing the voice by utilizing the habit of the user. The invention can adapt to the weak network environment and improve the accuracy of voice recognition, thereby improving the experience of the user in using voice interaction.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A voice recognition method based on cloud computing and edge computing is characterized by comprising the following steps:

(1) collecting voice information;

2. The method according to claim 1, wherein the step (1) includes uploading the collected voice information to a cloud and a local area after noise elimination and human voice detection.

3. The cloud computing and edge computing based voice recognition method according to claim 1, wherein in the step (1), it is determined whether a network exists, when the network exists, the voice information is uploaded to a cloud end and a local, respectively, and when the network does not exist, the voice information is uploaded to the local.

4. The cloud-computing and edge-computing based speech recognition method of claim 3, wherein in step (5), it is determined whether the first recognition confidence level T1 is much greater than the second recognition confidence level T2, when the first recognition confidence level T1 is much greater than the second recognition confidence level T2, a local scene recognition is performed and a second local scene result is output, otherwise, step (6) is entered.

5. The method according to claim 1, wherein the processing and computing the voice information through the cloud end in step (2) comprises recognizing the voice information through a cloud-end voice database; in the step (3), the processing and the calculation of the voice information locally comprise recognizing the voice information through a local voice database.

6. The cloud-computing and edge-computing based speech recognition method of claim 1, wherein in step (4), the first speech recognition result is subjected to intent scene localization to obtain the first cloud scene result S1, and the second speech recognition result is subjected to intent scene localization to obtain the first local scene result S2.

7. The method according to claim 1, wherein in step (4), when the first local scene result S2 is empty, performing cloud scene recognition and outputting the second cloud scene result, otherwise, determining whether T1 × T1> T2 × T2 is true, and outputting the first cloud scene result S1 when true, otherwise, performing local scene recognition and outputting the second cloud scene result.

8. The cloud computing and edge computing based speech recognition method according to claim 1, further comprising the step (7): and feeding back and outputting voice interaction information according to the output result.

9. An electronic device, comprising: a memory for storing one or more programs; a processor; the one or more programs, when executed by the processor, implement the method of any of claims 1-8.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.