CN111354349A

CN111354349A - Voice recognition method and device and electronic equipment

Info

Publication number: CN111354349A
Application number: CN201910305011.3A
Authority: CN
Inventors: 苑磊
Original assignee: Shenzhen Honghe Innovation Information Technology Co Ltd
Current assignee: Shenzhen Honghe Innovation Information Technology Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2020-06-30

Abstract

The invention discloses a voice recognition method and device and electronic equipment, wherein the voice recognition method comprises the following steps: acquiring voice information; matching a standard engine library according to the voice information; if the standard engine library is matched with the output first matching result, taking the first matching result as a voice recognition result; if the standard engine library is not matched with the output result, matching a non-standard engine library according to the voice information; and if the non-standard engine library is matched and outputs a second matching result, taking the second matching result as a voice recognition result. The invention can improve the accuracy of voice recognition.

Description

Voice recognition method and device and electronic equipment

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, and an electronic device.

Background

With the development of big data processing technology, the mass data is subject to specialized analysis and processing to form an information asset, and the method has great application value. At present, schools equipped with recording and broadcasting equipment generally record video content of each class by using the recording and broadcasting equipment, the video content comprises rich data contents such as voice data, expression data, limb data and the like of teachers and students, big data analysis is carried out on the video content, and the video content has potential information value for the education industry. Currently, most of the standard voice contents can be recognized from the video contents by using the existing voice recognition technology, but the non-standard (such as dialects in various places, popular words and the like) voice contents cannot be recognized.

Disclosure of Invention

In view of the above, the present invention provides a speech recognition method and apparatus, and an electronic device, which can improve the speech recognition accuracy.

Based on the above object, the present invention provides a speech recognition method, comprising:

acquiring voice information;

matching a standard engine library according to the voice information;

if the standard engine library is matched with the output first matching result, taking the first matching result as a voice recognition result;

if the standard engine library is not matched with the output result, matching a non-standard engine library according to the voice information;

and if the non-standard engine library is matched and outputs a second matching result, taking the second matching result as a voice recognition result.

Optionally, the non-standard engine library includes a non-standard speech recognition module and a non-standard speech database, and the non-standard speech recognition module is used to recognize the speech information and match a speech recognition result with the non-standard speech database.

Optionally, when the non-standard speech database does not match the speech recognition result, splitting the speech information into a plurality of phrases, inputting the plurality of phrases into the non-standard speech database respectively for matching, to obtain possible phrases of each phrase, and combining and matching the possible phrases of each phrase, to obtain a phrase combination with the maximum matching probability, which is used as the second matching result.

Optionally, the method further includes: and acquiring the voice information from the video information, identifying a face object emitting voice from the video information to obtain a face identification result, and associating the face identification result with the voice identification result.

Optionally, the method further includes: and extracting high-frequency words with the occurrence frequency larger than a certain threshold value from the voice recognition result to be used as keywords.

An embodiment of the present invention further provides a speech recognition apparatus, including:

the acquisition module is used for acquiring voice information;

the standard voice matching module is used for matching a standard engine library according to the voice information, and if a first matching result is output by matching the standard engine library, the first matching result is used as a voice recognition result;

and the non-standard voice matching module is used for matching the non-standard engine library according to the voice information if the standard engine library does not match the output result, and taking a second matching result as a voice recognition result if the non-standard engine library matches the output result.

Optionally, the non-standard engine library further includes:

and the splitting and matching module is used for splitting the voice information into a plurality of phrases when the non-standard voice database is not matched with the voice recognition result, respectively inputting the phrases into the non-standard voice database for matching to obtain possible phrases of each phrase, and combining and matching the possible phrases of each phrase to obtain a phrase combination with the maximum matching probability as the second matching result.

Optionally, the apparatus further comprises:

and the face recognition module is used for recognizing a face object emitting voice from the video information to obtain a face recognition result, and associating the face recognition result with the voice recognition result.

Optionally, the apparatus further comprises:

and the extraction module is used for extracting high-frequency words with the occurrence frequency larger than a certain threshold value from the voice recognition result to be used as key words.

The embodiment of the invention also provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the voice recognition method when executing the program.

From the above, the voice recognition method, the voice recognition device and the electronic equipment provided by the invention acquire the voice information from the video information, firstly input the voice information into the standard engine library for recognition, and if the voice information is not recognized, input the voice information into the non-standard engine library for recognition, so as to obtain a voice recognition result; the non-standard engine library can split phrases, match and recognize each phrase respectively, and then combine and match the phrases to finally obtain a combined phrase with the maximum matching probability as a voice recognition result. The invention can improve the breadth and accuracy of voice recognition, and provides a data base for the development and innovation of education industry by analyzing the big data of the video content recorded in the school classroom.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

The voice recognition method provided by the embodiment of the invention is used for recognizing the voice content from the video content, and comprises the following steps:

acquiring voice information;

matching a standard engine library according to the voice information;

if the standard engine library outputs a first matching result in a matching mode, taking the first matching result as a voice recognition result;

if the standard engine library is not matched with the output result, matching the non-standard engine library according to the voice information;

The voice recognition method of the embodiment of the invention establishes a standard engine library and a non-standard engine library, acquires voice information from video content, firstly matches the voice information with the standard engine library, if matching is successful, the standard engine library outputs a matched first matching result, and then converts the first matching result into character information to realize voice recognition; and if the standard engine library cannot recognize the voice information, matching the voice information with the non-standard engine library, if the matching is successful, outputting a matched second matching result by the non-standard engine library, and converting the second matching result into character information to realize voice recognition. The invention can identify standard voice information and non-standard voice information and improve the accuracy of voice identification.

FIG. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention. As shown in the figure, the speech recognition method provided in the embodiment of the present invention includes:

s10: acquiring voice information;

in the embodiment of the invention, the voice information is obtained from the video content of each class recorded by the recording and playing equipment, and the voice information in the video content is identified.

S11: matching a standard engine library according to the voice information;

the standard engine library comprises a standard voice recognition module, a standard voice database and the like. In the embodiment of the invention, the acquired voice information is input into a standard engine library, the voice information is recognized by using a standard voice recognition module, and a voice recognition result is matched from a standard voice database.

S12: if the standard engine library outputs a first matching result in a matching mode, taking the first matching result as a voice recognition result;

if the input voice information is standard voice information, the standard voice information is recognized by using a standard voice recognition module, a first matching result can be output from a standard voice database in a matching mode, and then the first matching result is converted into character information to serve as a voice recognition result.

The standard voice information is mandarin, standard words and the like, and the standard voice information such as mandarin, standard words and the like is stored in the standard voice database.

S13: if the standard engine library is not matched with the output result, matching the non-standard engine library according to the voice information;

the non-standard engine library comprises a non-standard voice recognition module, a non-standard voice database and the like. In the embodiment of the invention, if the standard engine library is not matched with the output recognition result, the acquired voice information is input into the non-standard engine library, the non-standard voice recognition module is used for recognizing the voice information, and the voice recognition result is matched from the non-standard voice database.

S14: and if the non-standard engine library is matched and outputs a second matching result, taking the second matching result as a voice recognition result.

If the input voice information is non-standard voice information, recognizing the non-standard voice information by using a non-voice recognition module, and matching a voice recognition result from a non-standard voice database; if the matching is successful, outputting a second matching result, converting the second matching result into character information as a voice recognition result, and if the matching is unsuccessful, outputting a prompt voice, wherein the prompt voice information is not recognized.

In an embodiment of the present invention, the non-standard engine library further includes: and the splitting and matching module is used for splitting the voice information into a plurality of phrases when the non-standard voice database does not match the voice recognition result, respectively inputting the phrases into the non-standard voice database for matching to obtain possible phrases of each phrase, and combining and matching the possible phrases of each phrase to obtain a phrase combination with the maximum matching probability as a second matching result. Specifically, the method comprises the following steps:

and for the non-standard voice information, if the non-standard engine library is not successfully matched, judging whether the non-standard voice information contains a plurality of phrases, if so, splitting the non-standard voice information into the phrases to form a plurality of phrases, respectively inputting the phrases into the non-standard engine library for matching, correspondingly outputting a plurality of matched possible phrases for each phrase, matching the plurality of possible phrases of each phrase, and obtaining a phrase combination with the maximum matching probability as a second matching result. For example, a location is commonly called three-bounce for a tricycle, the obtained non-standard voice information is 'mountain bounce', the non-standard voice information 'mountain bounce' is input into a non-standard engine library and is not successfully matched, the non-standard engine library divides 'mountain bounce' into two phrases of 'mountain' and 'bounce', the 'mountain' and the 'bounce' are respectively input into the non-standard engine library for recognition, for the non-standard voice information 'mountain', the recognition result of the non-standard engine library is 'three', 'mountain', 'umbrella', 'fir', and other possible phrases, and for the non-standard voice information 'bounce', the recognition result of the non-standard engine library is 'three bounce', 'little bounce', 'bounce', and other possible phrases are respectively combined and matched, so that the three-bounce ',' three-bounce ',' three bounce ',' two-bounce ', three-bounce', 'bounce', and 'bouncing' are obtained, The 'bounce umbrella' and the like, and a phrase combination 'three bounces' with the maximum matching probability is selected from the 'bounce umbrella' and the like as a second matching result.

The non-standard voice information includes, for example, local dialect, popular word, custom word, and special tone word, and the non-standard voice information includes, for example, local dialect, popular word, custom word, and special tone word stored in the non-standard voice database.

In the embodiment of the invention, the non-standard voice database of the non-standard engine library can be updated by a machine learning method so as to improve the breadth and accuracy of voice recognition.

In the embodiment of the present invention, the speech recognition method further includes: and acquiring voice content from the video content, identifying the face object which sends the voice while performing voice identification to obtain a face identification result, and associating the face identification result with the voice identification result. The face recognition result is basic information such as the name of the face object, the voice recognition result is character information converted from the first matching result or the second matching result, and the basic information of the face object is associated with the character information of the voice recognition to form the recognition result of the face object.

In the embodiment of the present invention, the speech recognition method further includes: and extracting high-frequency words with the occurrence frequency larger than a certain threshold value as keywords according to the voice recognition result. And performing voice recognition on the video content of a certain class or the video content of a certain subject within a certain time, and analyzing and processing the character information according to the character information converted from the first matching result or the second matching result to obtain high-frequency words with the occurrence frequency greater than a certain threshold value as keywords.

Fig. 2 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention. As shown in the drawings, a speech recognition apparatus provided in an embodiment of the present invention is configured to recognize speech content from video content, and the apparatus includes:

the acquisition module is used for acquiring voice information;

the standard voice matching module is used for matching the standard engine library according to the voice information, and taking a first matching result as a voice recognition result if the standard engine library is matched and outputs the first matching result;

and the non-standard voice matching module is used for matching the non-standard engine library according to the voice information if the standard engine library does not match the output result, and taking the second matching result as a voice recognition result if the non-standard engine library matches the output second matching result.

The voice recognition device of the embodiment of the invention establishes a standard engine library and a non-standard engine library, the acquisition module acquires voice information from video content, the standard voice matching module is firstly utilized to match the voice information with the standard engine library, if the matching is successful, the standard engine library outputs a matched first matching result, and then the first matching result is converted into character information to realize voice recognition; if the standard engine library can not recognize the voice information, the non-standard voice matching module is used for matching the voice information with the non-standard engine library, if the matching is successful, the non-standard engine library outputs a matched second matching result, and then the second matching result is converted into character information to realize voice recognition. The invention can identify standard voice information and non-standard voice information and improve the accuracy of voice identification.

In the embodiment of the invention, the standard engine library comprises a standard voice recognition module, a standard voice database and the like. In the embodiment of the invention, the acquired voice information is input into a standard engine library, the voice information is recognized by using a standard voice recognition module, and a voice recognition result is matched from a standard voice database. If the input voice information is standard voice information, the standard voice information is recognized by using a standard voice recognition module, a first matching result can be output from a standard voice database in a matching mode, and then the first matching result is converted into character information to serve as a voice recognition result.

In the embodiment of the invention, the non-standard engine library comprises a non-standard voice recognition module, a non-standard voice database and the like. In the embodiment of the invention, if the standard engine library is not matched with the output recognition result, the acquired voice information is input into the non-standard engine library, the non-standard voice recognition module is used for recognizing the voice information, and the voice recognition result is matched from the non-standard voice database. If the input voice information is non-standard voice information, recognizing the non-standard voice information by using a non-voice recognition module, and matching a voice recognition result from a non-standard voice database; if the matching is successful, outputting a second matching result, converting the second matching result into character information as a voice recognition result, and if the matching is unsuccessful, outputting a prompt voice, wherein the prompt voice information is not recognized.

In an embodiment of the present invention, the non-standard engine library further includes:

and the splitting and matching module is used for splitting the voice information into a plurality of phrases when the non-standard voice database does not match the voice recognition result, respectively inputting the phrases into the non-standard voice database for matching to obtain possible phrases of each phrase, and combining and matching the possible phrases of each phrase to obtain a phrase combination with the maximum matching probability as a second matching result.

In an embodiment of the present invention, the speech recognition apparatus further includes:

and the face recognition module is used for recognizing the face object which sends out the voice from the video content to obtain a face recognition result, and associating the face recognition result with the voice recognition result. The face recognition result is basic information such as the name of the face object, the voice recognition result is character information converted from the first matching result or the second matching result, and the basic information of the face object is associated with the character information of the voice recognition to form the recognition result of the face object.

and the extraction module is used for extracting high-frequency words with the occurrence frequency larger than a certain threshold value as the keywords according to the voice recognition result. And performing voice recognition on the video content of a certain class or the video content of a certain subject within a certain time, and analyzing and processing the character information according to the character information converted from the first matching result or the second matching result to obtain high-frequency words with the occurrence frequency greater than a certain threshold value as keywords.

In view of the above object, an embodiment of the present invention further provides an apparatus for performing the speech recognition method. The device comprises:

one or more processors, and a memory.

The apparatus for performing the voice recognition method may further include: an input device and an output device.

The processor, memory, input device, and output device may be connected by a bus or other means.

The memory, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speech recognition method in embodiments of the present invention. The processor executes various functional applications of the server and data processing by running nonvolatile software programs, instructions and modules stored in the memory, namely, implements the voice recognition method of the above-described method embodiment.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an apparatus performing the voice recognition method, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory remotely located from the processor, and these remote memories may be connected to the member user behavior monitoring device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device may receive input numeric or character information and generate key signal inputs related to user settings and function control of the device performing the voice recognition method. The output device may include a display device such as a display screen.

The one or more modules are stored in the memory and, when executed by the one or more processors, perform the speech recognition method of any of the method embodiments described above. The technical effect of the embodiment of the device for executing the voice recognition method is the same as or similar to that of any method embodiment.

The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the processing method of the list item operation in any method embodiment. Embodiments of the non-transitory computer storage medium may be the same or similar in technical effect to any of the method embodiments described above.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program that can be stored in a computer-readable storage medium and that, when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The technical effect of the embodiment of the computer program is the same as or similar to that of any of the method embodiments described above.

Furthermore, the apparatuses, devices, etc. described in the present disclosure may be various electronic terminal devices, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television, etc., and may also be large terminal devices, such as a server, etc., and therefore the scope of protection of the present disclosure should not be limited to a specific type of apparatus, device. The client disclosed by the present disclosure may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.

Furthermore, the method according to the present disclosure may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method of the present disclosure.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A speech recognition method, comprising:

acquiring voice information;

matching a standard engine library according to the voice information;

2. The method of claim 1, wherein the non-standard engine library comprises a non-standard speech recognition module, a non-standard speech database, and wherein the speech information is recognized by the non-standard speech recognition module and the speech recognition result is matched from the non-standard speech database.

3. The method according to claim 2, wherein when the non-standard speech database does not match the speech recognition result, the speech information is split into a plurality of phrases, the phrases are respectively input into the non-standard speech database for matching to obtain possible phrases of each phrase, and the possible phrases of each phrase are combined and matched to obtain a phrase combination with the maximum matching probability as the second matching result.

4. The method of claim 1, further comprising: and acquiring the voice information from the video information, identifying a face object emitting voice from the video information to obtain a face identification result, and associating the face identification result with the voice identification result.

5. The method of claim 1, further comprising: and extracting high-frequency words with the occurrence frequency larger than a certain threshold value from the voice recognition result to be used as keywords.

6. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring voice information;

7. The apparatus of claim 6, wherein the non-standard engine library comprises a non-standard speech recognition module, a non-standard speech database, and wherein the non-standard speech recognition module is used to recognize the speech information and match speech recognition results from the non-standard speech database.

8. The apparatus of claim 7, wherein the non-standard engine library further comprises:

9. The apparatus of claim 6, further comprising:

10. The apparatus of claim 6, further comprising:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the program.