CN112802457A

CN112802457A - Method, device, equipment and storage medium for voice recognition

Info

Publication number: CN112802457A
Application number: CN202110397846.3A
Authority: CN
Inventors: 张骞
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-05-14

Abstract

The application provides a method, a device, equipment and a storage medium for voice recognition, and relates to the field of voice recognition. The method specifically comprises the following steps: the segmentation micro service segments the audio data to obtain a set comprising a plurality of audio fragments; the segmentation microservice initiating a request to the speech recognition microservice to process the set in a concurrent manner; and the speech recognition micro-service performs speech recognition on the plurality of audio segments in the set in parallel according to the request. By adopting the embodiment of the application, the system utilization rate can be improved, and the processing speed of audio identification is effectively improved.

Description

Method, device, equipment and storage medium for voice recognition

Technical Field

The present application relates to the field of speech recognition, and in particular, to a method, an apparatus, a device, and a storage medium for speech recognition.

Background

In the process of speech recognition, it is often the case that large files are slower to process audio. At present, in the existing voice file recognition schemes, one scheme can only support maximum 100M audio, the processing speed is slow, a voice file with the duration of 30 minutes needs about 10 seconds of processing time, and if a large number of voice files needing to be processed exist, the increase of the concurrency and the purchase duration are considerable expenses. Another solution can support a maximum of 512M audio, but audio of about 5 hours duration requires about 5 hours of processing time for the user to wait, which is almost 1:1 processing efficiency, and the processing speed is quite slow. Therefore, how to increase the processing speed of large file audio, such as processing audio of 1 hour or even longer at the speed of second, is a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for voice recognition, which are used for solving the problems in the related technology, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for speech recognition, including:

the segmentation micro service segments the audio data to obtain a set comprising a plurality of audio fragments;

the segmentation microservice initiates a request to process the collection to a speech recognition microservice in a concurrent manner;

and the speech recognition micro-service performs speech recognition on the plurality of audio fragments in the set in parallel according to the request.

In a second aspect, an embodiment of the present application provides an apparatus for speech recognition, including:

the segmentation microserver module is used for segmenting the audio data to obtain a set comprising a plurality of audio fragments and initiating a request for processing the set in a concurrent mode;

and the voice recognition micro-service module is used for performing voice recognition on the plurality of audio clips in the set in parallel according to the request of the segmentation micro-service module.

In a third aspect, an embodiment of the present application provides a device for speech recognition, where the device includes: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the memory-stored instructions, and the processor is configured to cause the processor to perform the method of any of the above-described aspects when executing the memory-stored instructions.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include: the audio data are processed in a concurrent mode after being segmented, the time consumption of segmentation can be ignored, compared with the mode of directly processing the audio data, the method can utilize computing resources to the maximum extent, greatly improves the processing speed, has good expandability, and can quickly and effectively process larger audio files.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a flow diagram of a method of speech recognition according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating audio data slicing principles according to an embodiment of the present application;

FIG. 3 is a flow diagram of a method of speech recognition according to another embodiment of the present application;

FIG. 4 is a block diagram illustrating a method of speech recognition according to an embodiment of the present application;

FIG. 5 is a block diagram of an apparatus for speech recognition according to an embodiment of the present application;

FIG. 6 is a block diagram of an apparatus for speech recognition according to another embodiment of the present application;

fig. 7 is a block diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

FIG. 1 shows a flow diagram of a method of speech recognition according to an embodiment of the present application. As shown in fig. 1, the method of speech recognition may include:

s11, the segmentation micro service segments the audio data to obtain a set comprising a plurality of audio fragments;

s12, the segmentation micro-service sends a request for processing the set to the voice recognition micro-service in a concurrent mode;

and S13, the speech recognition micro-service performs speech recognition on the plurality of audio fragments in the set in parallel according to the request.

The method processes the audio data in a concurrent mode after segmenting the audio data, the time consumption of segmentation can be ignored, compared with the mode of directly processing the audio data, the method can utilize computing resources to the maximum extent, the processing speed is greatly improved, and if the audio data with the processing time of more than 1 hour can be controlled to be completed within ten seconds, the processing speed is more than twice of the processing speed in the prior art. The method has good expandability, can quickly and effectively process larger audio files, and improves the size of a single file capable of processing audio data.

In one embodiment, the S11 may include:

the segmentation microservice segments the audio data by using Voice Activity Detection (VAD) and Voice pause as a segmentation mark to obtain a set comprising a plurality of audio segments, wherein each audio segment corresponds to a sentence.

Generally, human speech is in sentence units, and in a dialogue scene, there may be several sentences, and there may be pauses of a certain time length between each sentence. Considering that audio data in a dialog scene may be in hours, it is very time consuming to directly process the audio data in hours. Therefore, in the present application, the VAD can be used to segment the audio data into audio segments corresponding to sentences according to the speech pause. VAD is also called voice endpoint detection or voice boundary detection, and there are various implementation manners, for example, sensitivity and false detection of missing detection may be performed before segmentation, the mute duration may be removed in the segmentation process, and noise reduction processing of a gaussian model may be performed on audio data, so as to remove interference of signals such as background sound and current noise during conversation. The audio before pause is identified as a sentence by taking the pause as a segmentation mark, and then a plurality of audio segments can be obtained, wherein each audio segment corresponds to a sentence. Thereby providing data support for subsequent parallel processing.

Fig. 2 is a schematic diagram illustrating an audio data slicing principle according to an embodiment of the present application. Referring to fig. 2, audio data to be processed is obtained, and the audio data carries the following characters: "you are good, i call Xiaoming, from Beijing West district, i like programming, i love Beijing". After the voice pause is segmented by VAD, 5 sentences are obtained, namely 'hello', 'I call Xiaoming', 'from Beijing Western district', 'I like programming' and 'I love Beijing', so that 5 audio clips are obtained, and therefore parallel voice recognition processing can be carried out.

In one embodiment, the S13 may include:

and the speech recognition micro-service adopts IO multiplexing technology to perform speech recognition on the plurality of audio segments in the set in parallel according to the request. The method can fully improve the IO resource utilization rate of the computer.

In another embodiment, the S13 may include:

and the speech recognition micro-service performs speech recognition on the plurality of audio clips in the set in parallel according to the request and the concurrency number preset according to the computing resources. The method can utilize system resources to the maximum extent, improves the utilization rate and is flexible and convenient to apply.

The concurrency number may be set according to a computing resource in an actual scene, and is not limited specifically. For example, two 32-core CPU computers, set the maximum concurrency number to 64, can support 64 concurrent processing of multiple audio clips. If a 70 minute segment of audio data of about 128MB size is sliced to provide 958, a total of about 9 seconds may take to complete the process using 64 concurrency.

FIG. 3 is a flow chart of a method of speech recognition according to another embodiment of the present application. As shown in fig. 3, the method of speech recognition may include:

s31, the segmentation micro service segments the audio data to obtain a set comprising a plurality of audio fragments;

s32, the segmentation micro service generates a uniform resource locator for the set, and sends the uniform resource locator to the voice recognition micro service in a mode of hypertext transfer protocol request to process the set;

and S33, the speech recognition micro-service adopts IO multiplexing technology according to the request and performs speech recognition on the plurality of audio clips in the set in parallel according to the concurrency number preset according to the computing resources.

In one of the above modes, the method further includes:

and combining and outputting the processing results of the plurality of audio clips obtained by the voice recognition micro-service.

In the application, the segmentation and the voice recognition are encapsulated in the form of micro-service, and the upstream segmentation micro-service calls the downstream voice recognition micro-service in a concurrent form, so that the audio data can be processed with the maximum capacity. Among them, microservice is a variant of a software development technology-Service Oriented Architecture (SOA) architectural style, building applications as a set of loosely coupled services. In the microservice architecture, services are fine-grained and protocols are lightweight.

Fig. 4 is a schematic diagram illustrating an architecture of a speech recognition method according to an embodiment of the present application. As shown in fig. 4, audio data may be downloaded through an interface using a download service and stored in the memory Redis. Then the segmentation micro-service segments the audio data into a set comprising a plurality of audio segments by VAD, and initiates a request in a concurrent mode, and the speech recognition micro-service processes the plurality of audio segments in the set in parallel by ASR. And after all the data are processed concurrently, combining the processing results and outputting the processing results.

Fig. 5 is a block diagram illustrating a voice recognition apparatus according to an embodiment of the present invention. As shown in fig. 5, the speech recognition apparatus 500 may include:

a segmentation microserver module 501, configured to segment audio data to obtain a set including multiple audio segments, and initiate a request for processing the set in a concurrent manner;

and the speech recognition micro-service module 502 is configured to perform speech recognition on the multiple audio clips in the set in parallel according to the request of the segmentation micro-service module 501.

According to another embodiment of the present invention, when the segmentation microservice module 501 initiates a request for processing the set in a concurrent manner, the segmentation microservice module is specifically configured to:

a uniform resource locator is generated for the set and sent to the speech recognition microservice module 502 in a hypertext transfer protocol request requesting processing of the set.

According to another embodiment of the present invention, the speech recognition micro-service module 502 is specifically configured to:

according to the request, the IO multiplexing technology is adopted to perform speech recognition on the plurality of audio segments in the set in parallel.

and according to the request, performing voice recognition on the plurality of audio segments in the set in parallel according to the concurrency number preset according to the computing resources.

According to another embodiment of the present invention, when the segmentation microservice module 501 segments the audio data to obtain a set including a plurality of audio segments, the segmentation microservice module is specifically configured to:

and segmenting the audio data by adopting voice activity detection and taking voice pause as a segmentation mark to obtain a set comprising a plurality of audio segments, wherein each audio segment corresponds to a sentence.

Fig. 6 is a block diagram illustrating a structure of a speech recognition apparatus according to another embodiment of the present invention. As shown in fig. 6, the apparatus 600 for speech recognition includes:

the segmentation microserver module 601 is configured to segment audio data to obtain a set including a plurality of audio segments, and initiate a request for processing the set in a concurrent manner;

the speech recognition micro-service module 602 is configured to perform speech recognition on the multiple audio clips in the set in parallel according to the request of the segmentation micro-service module 601;

an output module 603, configured to combine and output the processing results of the multiple audio clips obtained by the speech recognition micro-service module 602.

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

The device processes the audio data in a concurrent mode after segmenting the audio data, the time consumption of segmentation can be ignored, compared with the mode of directly processing the audio data, the device can utilize computing resources to the maximum extent, the processing speed is greatly improved, and if the audio data with the processing time of more than 1 hour can be controlled to be completed within ten seconds, the processing speed is more than twice of the processing speed in the prior art. The method has good expandability, can quickly and effectively process larger audio files, and improves the size of a single file capable of processing audio data.

Fig. 7 is a block diagram illustrating a structure of a speech recognition apparatus according to an embodiment of the present invention. As shown in fig. 7, the apparatus for speech recognition includes: a memory 710 and a processor 720, the memory 710 having stored therein computer programs that are executable on the processor 720. The processor 720, when executing the computer program, implements the method of speech recognition in the above-described embodiments. The number of the memory 710 and the processor 720 may be one or more.

The apparatus for speech recognition further comprises:

and a communication interface 730, configured to communicate with an external device, and perform data interactive transmission.

If the memory 710, the processor 720 and the communication interface 730 are implemented independently, the memory 710, the processor 720 and the communication interface 730 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 710, the processor 720 and the communication interface 730 are integrated on a chip, the memory 710, the processor 720 and the communication interface 730 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech recognition, comprising:

2. The method of claim 1, wherein the slicing microservice initiates a request to process the collection to a speech recognition microservice in a concurrent manner, comprising:

and the segmentation micro service generates a uniform resource locator for the set, sends the uniform resource locator to the voice recognition micro service in a mode of hypertext transfer protocol request, and requests to process the set.

3. The method of claim 1, wherein the speech recognition microservice performs speech recognition on the plurality of audio segments in the set in parallel according to the request, comprising:

and the voice recognition micro-service adopts an IO multiplexing technology to perform voice recognition on the plurality of audio clips in the set in parallel according to the request.

4. The method of claim 1, wherein the speech recognition microservice performs speech recognition on the plurality of audio segments in the set in parallel according to the request, comprising:

and the voice recognition micro-service performs voice recognition on the plurality of audio clips in the set in parallel according to the request and the concurrency number preset according to the computing resources.

5. The method of claim 1, wherein the slicing microservice slices audio data into a set comprising a plurality of audio segments, comprising:

the segmentation microservice segments the audio data by using voice activity detection and voice pause as a segmentation mark to obtain a set comprising a plurality of audio segments, wherein each audio segment corresponds to a sentence.

6. The method of claim 1, further comprising:

7. An apparatus for speech recognition, comprising:

8. The apparatus of claim 7, wherein the split microservice module, when initiating the request to process the set in a concurrent manner, is specifically configured to:

and generating a uniform resource locator for the set, sending the uniform resource locator to the voice recognition micro-service module in a mode of hypertext transfer protocol request, and requesting to process the set.

9. The apparatus of claim 7, wherein the speech recognition microservice module is specifically configured to:

and according to the request, performing voice recognition on the plurality of audio segments in the set in parallel by adopting an IO multiplexing technology.

10. The apparatus of claim 7, wherein the speech recognition microservice module is specifically configured to:

11. The apparatus of claim 7, wherein the slicing microserver module, when slicing the audio data into the set including the plurality of audio segments, is specifically configured to:

12. The apparatus of claim 7, further comprising:

and the output module is used for combining and outputting the processing results of the plurality of audio clips obtained by the voice recognition micro-service module.

13. An apparatus for speech recognition, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 6.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.