CN116612743A

CN116612743A - Speech recognition model evaluation method and device, electronic equipment and storage medium

Info

Publication number: CN116612743A
Application number: CN202310736927.0A
Authority: CN
Inventors: 王伟戌
Original assignee: Beijing Yunsizhixue Technology Co ltd
Current assignee: Beijing Yunsizhixue Technology Co ltd
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-08-18

Abstract

The disclosure provides a method and a device for evaluating a voice recognition model, electronic equipment and a storage medium, wherein a voice data set to be marked is divided into data blocks with preset quantity; respectively inputting a preset number of data blocks into a corresponding number of voice recognition models to respectively obtain a recognition result set corresponding to each voice recognition model; respectively determining a recognition result from different recognition result sets as a reference recognition text; after labeling the reference recognition text, each speech recognition model is evaluated in turn based on the labeled reference recognition text. Compared with the related art, the evaluation of each voice recognition model is realized on the labeling result by respectively determining one recognition result in each recognition result set as a reference recognition text. By generating the reference recognition text in a crossing way, the influence of the labeling effect of the reference recognition text on the recognition accuracy of the voice recognition model can be reduced, and further objective evaluation of the voice recognition model can be realized.

Description

Speech recognition model evaluation method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of data processing, and in particular relates to a method and a device for evaluating a voice recognition model, electronic equipment and a storage medium.

Background

As speech recognition technology advances, the number of speech recognition models that can be recognized becomes larger and larger. Meanwhile, higher requirements are also put forward on the recognition accuracy of the voice recognition model. The labeling of the speech data affects the accuracy of the speech recognition model. The voice marking is carried out according to the reference recognition text, the voice marking process is affected by various factors, the marking effect is poor, and the recognition accuracy of the image voice recognition model is improved. Because the quality of the voice label directly influences the recognition accuracy of the voice recognition model, if different reference recognition texts are used for training different voice recognition models, the quality of the model cannot be directly judged when the voice recognition models are evaluated.

Disclosure of Invention

The disclosure provides a method and a device for evaluating a speech recognition model, electronic equipment and a storage medium. The main purpose of which is to achieve an evaluation of the different speech recognition models.

According to a first aspect of the present disclosure, there is provided a method for evaluating a speech recognition model, including:

dividing a voice data set to be marked into data blocks with preset quantity;

respectively inputting the preset number of data blocks into a corresponding number of voice recognition models to respectively obtain a recognition result set corresponding to each voice recognition model; wherein each data block corresponds to one identification result;

respectively determining a recognition result from the recognition result sets as a reference recognition text;

and after the reference recognition text is marked, evaluating each voice recognition model in turn based on the marked reference recognition text.

Optionally, after the labeling processing is performed on the reference recognition text, evaluating each voice recognition model sequentially based on the labeled reference recognition text includes:

respectively selecting one target recognition result corresponding to the data block in each recognition result set to generate different target recognition result combinations with preset quantity;

calculating a first recognition error rate of the target recognition result combinations with the preset number based on the marked reference recognition text;

each of the speech recognition models is evaluated based on the first error rate.

Optionally, the dividing the voice data set to be marked into the data blocks with the preset number includes:

acquiring the number of voice data to be marked in the voice data set to be marked;

and equally dividing the voice data set to be marked into data blocks with the preset number based on the number of the voice data to be marked.

Optionally, the step of inputting the preset number of data blocks into a corresponding number of voice recognition models respectively, and the step of obtaining a recognition result set corresponding to each voice recognition model respectively includes:

respectively carrying out recognition processing on each data block by using each voice recognition model to obtain a recognition result of each voice recognition model on each data block;

generating a recognition result set of the recognition result of any voice recognition model on each data block.

Optionally, the determining one recognition result from the different recognition result sets as the reference recognition text includes:

and respectively selecting one recognition result corresponding to the data blocks from different recognition result sets as a reference recognition text, wherein all the selected data blocks form the voice data set to be marked.

Optionally, after labeling the reference identification text, the method further includes:

calculating a second recognition error rate of each recognition result set under different batches;

generating annotation fluctuation of annotation processing under different batches based on the second recognition error rate and the annotation result;

and evaluating the labeling result based on the labeling fluctuation.

According to a second aspect of the present disclosure, there is provided an evaluation apparatus of a speech recognition model, comprising:

the dividing unit is used for dividing the voice data set to be marked into data blocks with preset quantity;

the processing unit is used for inputting the preset number of data blocks into a corresponding number of voice recognition models respectively to obtain a recognition result set corresponding to each voice recognition model respectively; wherein each data block corresponds to one identification result;

the first generation unit is used for respectively determining one recognition result from the recognition result sets as a reference recognition text;

and the first evaluation unit is used for evaluating each voice recognition model in turn based on the labeled reference recognition text after labeling the reference recognition text.

Optionally, the first evaluation unit includes:

the first generation module is used for respectively selecting a target recognition result corresponding to the data block in each recognition result set to generate different target recognition result combinations with preset quantity;

the calculation module is used for calculating the first recognition error rate of the target recognition result combinations with the preset number based on the marked reference recognition text;

and the evaluation module is used for evaluating each voice recognition model based on the first error rate.

Optionally, the dividing unit includes:

the acquisition module is used for acquiring the quantity of the voice data to be marked in the voice data set to be marked;

the dividing module is used for equally dividing the voice data set to be marked into data blocks with the preset number based on the number of the voice data to be marked.

Optionally, the processing unit includes:

the processing module is used for respectively carrying out recognition processing on each data block by utilizing each voice recognition model to obtain a recognition result of each voice recognition model on each data block;

and the second generation module is used for generating a recognition result set of the recognition result of any voice recognition model on each data block.

Optionally, the first generating unit is further configured to:

Optionally, the apparatus further includes:

the calculating unit is used for calculating the second recognition error rate of each recognition result set in different batches after the reference recognition text is marked;

the second generating unit is used for generating marking fluctuation of marking processing under different batches based on the second recognition error rate and the marking result;

and the second evaluation unit is used for evaluating the labeling result based on the labeling fluctuation.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the preceding first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.

The disclosure provides a method and a device for evaluating a voice recognition model, electronic equipment and a storage medium, wherein a voice data set to be marked is divided into data blocks with preset quantity; respectively inputting the preset number of data blocks into a corresponding number of voice recognition models to respectively obtain a recognition result set corresponding to each voice recognition model;

wherein each data block corresponds to one identification result; respectively determining a recognition result from the recognition result sets as a reference recognition text; and after the reference recognition text is marked, evaluating each voice recognition model in turn based on the marked reference recognition text. Compared with the related art, the method has the advantages that one recognition result is respectively determined in each recognition result set to serve as a reference recognition text, the reference labeling text is labeled, and evaluation of each voice recognition model is achieved based on the labeling result. By generating the reference recognition text in a crossing way, the influence of the labeling effect of the reference recognition text on the recognition accuracy of the voice recognition model can be reduced, and further objective evaluation of the voice recognition model can be realized.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a method for evaluating a speech recognition model according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating another method for evaluating a speech recognition model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a method for evaluating annotation processing according to an embodiment of the disclosure;

fig. 4 is a schematic structural diagram of an evaluation device for a speech recognition model according to an embodiment of the disclosure;

FIG. 5 is a schematic structural diagram of another speech recognition model evaluation apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic block diagram of an example electronic device provided by an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes a method and apparatus for evaluating a speech recognition model, an electronic device, and a storage medium according to embodiments of the present disclosure with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for evaluating a speech recognition model according to an embodiment of the disclosure.

As shown in fig. 1, the method comprises the steps of:

and step 101, dividing the voice data set to be marked into a preset number of data blocks.

In the embodiment of the present disclosure, the voice data set to be annotated is a set of voice data (voice data to be annotated) that needs to be annotated. The number of the voice data sets to be marked divided into the data blocks is determined according to the number of the voice recognition models which are evaluated as required. For example, the model a and the model B need to be evaluated, and the to-be-annotated speech data set is divided into two parts, part1 and part2; it should be noted that the embodiments of the present disclosure are described with a preset number of two, which does not limit the present disclosure. part1 and part2 may be divided equally or in other proportions, which is not limited in the embodiments of the present disclosure.

102, respectively inputting the preset number of data blocks into a corresponding number of voice recognition models to respectively obtain a recognition result set corresponding to each voice recognition model; wherein each data block corresponds to one recognition result.

In the embodiment of the disclosure, the recognition result set is a set of recognition results obtained by processing the data blocks in the voice data set to be marked by the voice recognition model. Inputting a voice data set to be marked (a preset number of data blocks) into a voice recognition model to obtain a recognition result of each data block; and the voice recognition models respectively process the preset number of data blocks to obtain a plurality of recognition result sets.

For example, assuming that the voice data set to be annotated is divided into two parts of part1 and part2, part1 and part2 are input into a model a (voice recognition model a) to obtain { A1, A2} (recognition result set).

And step 103, respectively determining one recognition result from the recognition result sets as a reference recognition text.

In the embodiment of the disclosure, since the quality of the voice label directly affects the recognition accuracy of the voice recognition model, if different reference recognition texts are used for training different voice recognition models, the quality of the model cannot be directly judged when the voice recognition models are evaluated. Therefore, the corresponding reference recognition text can be generated by processing the corresponding number of recognition result sets obtained by processing the voice data set to be annotated (the preset number of data blocks) by using each voice recognition model. The specific generation process is that one recognition result corresponding to the data block is selected from each recognition result set, and the result after the data blocks corresponding to the selected recognition result are combined is a complete voice data set to be marked.

And 104, after the reference recognition text is marked, evaluating each voice recognition model in turn based on the marked reference recognition text.

In an embodiment of the present disclosure, after labeling the reference recognition text, each speech recognition model is evaluated using the labeled reference recognition text. Since the reference recognition text is generated according to the recognition result of each voice recognition model, the recognition result of each voice recognition model to one data block in the voice data set to be annotated can be obtained. And evaluating the recognition result of each voice recognition model according to the reference recognition text, thereby realizing the evaluation of the voice recognition model.

The present disclosure provides an evaluation method of a speech recognition model, which divides a speech data set to be marked into data blocks with preset quantity; respectively inputting the preset number of data blocks into a corresponding number of voice recognition models to respectively obtain a recognition result set corresponding to each voice recognition model; wherein each data block corresponds to one identification result; respectively determining a recognition result from the recognition result sets as a reference recognition text; and after the reference recognition text is marked, evaluating each voice recognition model in turn based on the marked reference recognition text. Compared with the related art, the method has the advantages that one recognition result is respectively determined in each recognition result set to serve as a reference recognition text, the reference labeling text is labeled, and evaluation of each voice recognition model is achieved based on the labeling result. By generating the reference recognition text in a crossing way, the influence of the labeling effect of the reference recognition text on the recognition accuracy of the voice recognition model can be reduced, and further objective evaluation of the voice recognition model can be realized.

In order to clearly illustrate the embodiments of the present disclosure, the embodiments of the present disclosure provide a flow diagram of another method for evaluating a speech recognition model.

As shown in fig. 2, the method comprises the steps of:

step 201, obtaining the number of the voice data to be marked in the voice data set to be marked.

Step 202, equally dividing the voice data set to be marked into data blocks with the preset number based on the number of voice data to be marked.

In particular, in the embodiment of the present disclosure, the to-be-annotated voice data set includes a plurality of pieces of to-be-annotated voice data, and after the number of to-be-annotated voice data is obtained, the to-be-annotated voice data set is equally divided into data blocks with a corresponding number according to the number of voice recognition models to be evaluated.

For example, assume that the voice data to be annotated in the voice data set to be annotated is 5000, and there are two voice recognition models (model a and model B); dividing the voice data set to be marked into two parts, namely part1 and part 2.

And 203, respectively carrying out recognition processing on each data block by using each voice recognition model to obtain a recognition result of each voice recognition model on each data block.

In particular, in the embodiment of the present disclosure, a to-be-labeled voice data set (a preset number of data blocks) is respectively input into a preset number of voice recognition models, so as to obtain a corresponding number of recognition result sets.

For example, the to-be-annotated speech data sets (part 1 and part 2) are respectively input into a model a and a model B, so as to obtain a recognition result A1 of the model a on the part1, a recognition result A2 of the model a on the part2, a recognition result B1 of the model B on the part1 and a recognition result B2 of the model B on the part 2.

Step 204, generating a recognition result set of the recognition result of each data block by using any one of the voice recognition models.

In particular, in the embodiment of the present disclosure, the recognition results of any one speech recognition model for each data block are synthesized into a recognition result set.

Illustratively, the recognition result A1 of the part1 by the model a and the recognition result A2 of the part2 by the model a are combined into the recognition result set { A1, A2}, and the recognition result B1 of the part1 by the model B and the recognition result B2 of the part2 by the model B are combined into the recognition result set { B1, B2}.

And 205, respectively selecting one recognition result corresponding to the data blocks from the recognition result sets as a reference recognition text, wherein all the selected data blocks form the voice data set to be marked.

In particular, in the embodiment of the present disclosure, an identification result corresponding to one data block is selected from each identification result set, and a result obtained by combining the selected identification results corresponding to the data blocks is a complete voice data set to be marked.

Illustratively, A1 and B2 are selected from the recognition result sets { A1, A2} and { B1, B2} as reference recognition texts, or A2 and B1 are selected as reference recognition texts. It should be noted that, the reference recognition text formed by the data blocks of the selected pair is the same as the complete voice data set to be marked.

Step 206, selecting a target recognition result corresponding to the data block in each recognition result set respectively, and generating different target recognition result combinations with preset quantity.

In particular, in the embodiment of the present disclosure, the target recognition result combination is that a recognition result corresponding to one data block is selected from each recognition result set, and at the same time, the result after the selected recognition result corresponds to the data block combination is a complete voice data set to be marked.

Illustratively, A1 and B2 are selected from the recognition result sets { A1, A2} and { B1, B2} as one target recognition result combination, and A2 and B1 are selected as another target recognition result combination. When more speech recognition models are evaluated, assuming that the models A, B and C need to be evaluated, generating a corresponding number of target recognition result combinations { A1, B2, C3}, { A2, B3, B1 and C2} according to the recognition result sets { A1, A2 and A3} of the model A, the recognition result sets { B1, B2 and B3} of the model B and the recognition result sets { C1, C2 and C3} of the model C; the reference recognition text may be any one of a combination of target recognition results.

Step 207, calculating a first recognition error rate of the target recognition result combination of the preset number based on the annotated reference recognition text.

Step 208, evaluating each of the speech recognition models based on the first error rate.

In particular, in the disclosed embodiments, the first recognition error rate may be represented using, but not limited to, a word error rate. The evaluation of different speech recognition models is further realized by calculating word error rates of different target recognition result combinations.

Illustratively, model A and model B are evaluated, and { A1, B2} are used as reference recognition texts to calculate first recognition error rates of { A1, B2} and { A2, B1} respectively.

The { A1, B2} is used as the reference recognition text, and the voice recognition model can be evaluated to objectively evaluate the advantages and disadvantages of different models. When the word error rate is calculated, the data of the part1 is partitioned, and the model A is better than the model B in terms of recognition accuracy because the data is marked according to the recognition result A1; for the data block of part2, because the data block is marked according to the recognition result B2, the model B is better than the model A in terms of recognition accuracy. However, the performance is the same for both model A and model B in { A2, B1}, and the evaluation of model A and model B can be achieved by comparing the first recognition error rates of A1, B2} and { A2, B1 }.

The labeling effects of labeling processes of different batches have a certain difference, so after labeling processes are performed on the reference identification text, the disclosure further provides a method for evaluating the labeling process, and fig. 3 is a flow chart of a method for evaluating the labeling process provided by an embodiment of the disclosure.

As shown in fig. 3, the method comprises the steps of:

step 301, calculating a second recognition error rate of each recognition result set under different batches.

And step 302, generating annotation fluctuation of annotation processing under different batches based on the second recognition error rate and the annotation result.

And step 303, evaluating the labeling result based on the labeling fluctuation.

In particular, in the embodiments of the present disclosure, the annotation fluctuation is the magnitude of the impact of the reference recognition text on the annotation, and may be measured, for example, but not limited to, according to the difference in recognition error rates of different batches. When the labeling process is performed, the labeling effect of the labeling process of different batches is different, so that the labeling result of each batch can be evaluated by calculating the labeling fluctuation.

Illustratively, annotation fluctuations are generated by calculating a second recognition error rate for the recognition result sets { A1, A2} and { B1, B2}. And evaluating the quality of the labeling results of the labeling processes of different batches according to the generated labeling fluctuation.

It should be noted that, in the embodiments of the present disclosure, a plurality of steps may be included, and these steps are numbered for convenience of description, but these numbers are not limitations on the execution time slots and execution orders between the steps; the steps may be performed in any order, and embodiments of the present disclosure are not limited in this regard.

Corresponding to the evaluation method based on the voice recognition model, the application also provides an evaluation device of the voice recognition model. Since the device embodiment of the present application corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment may refer to the above-mentioned method embodiment, and details are not described in detail in the present application.

Fig. 4 is a schematic structural diagram of an evaluation device for a speech recognition model according to an embodiment of the present disclosure, as shown in fig. 4, including:

a dividing unit 41, configured to divide a voice data set to be annotated into a preset number of data blocks;

the processing unit 42 is configured to divide the preset number of data into blocks, respectively input a corresponding number of speech recognition models, and respectively obtain a recognition result set corresponding to each speech recognition model; wherein each data block corresponds to one identification result;

a first generating unit 43 for determining one recognition result from the different recognition result sets as a reference recognition text, respectively;

the first evaluation unit 44 is configured to, after labeling the reference recognition text, evaluate each of the speech recognition models in turn based on the labeled reference recognition text.

The disclosure provides an evaluation device of a voice recognition model, which divides a voice data set to be marked into data blocks with preset quantity; respectively inputting the preset number of data blocks into a corresponding number of voice recognition models to respectively obtain a recognition result set corresponding to each voice recognition model; wherein each data block corresponds to one identification result; respectively determining a recognition result from the recognition result sets as a reference recognition text; and after the reference recognition text is marked, evaluating each voice recognition model in turn based on the marked reference recognition text. Compared with the related art, the method has the advantages that one recognition result is respectively determined in each recognition result set to serve as a reference recognition text, the reference labeling text is labeled, and evaluation of each voice recognition model is achieved based on the labeling result. By generating the reference recognition text in a crossing way, the influence of the labeling effect of the reference recognition text on the recognition accuracy of the voice recognition model can be reduced, and further objective evaluation of the voice recognition model can be realized.

Further, in one possible implementation manner of this embodiment, as shown in fig. 5, the first evaluation unit 34 includes:

a first generating module 441, configured to select a target recognition result corresponding to the data block in each recognition result set, and generate different target recognition result combinations with a preset number;

a calculating module 442, configured to calculate a first recognition error rate of the preset number of target recognition result combinations based on the annotated reference recognition text;

an evaluation module 443 for evaluating each of the speech recognition models based on the first error rate.

Further, in one possible implementation manner of this embodiment, as shown in fig. 5, the dividing unit 41 includes:

an obtaining module 411, configured to obtain the number of to-be-annotated voice data in the to-be-annotated voice data set;

the dividing module 412 is configured to equally divide the voice data set to be marked into the data blocks with the preset number based on the number of voice data to be marked.

Further, in one possible implementation manner of this embodiment, as shown in fig. 5, the processing unit 42 includes:

the processing module 421 is configured to perform recognition processing on each data block by using each voice recognition model, so as to obtain a recognition result of each voice recognition model on each data block;

a second generating module 422, configured to generate a set of recognition results for each of the data blocks by using any one of the speech recognition models.

Further, in a possible implementation manner of this embodiment, the first generating unit 43 is further configured to:

Further, in a possible implementation manner of this embodiment, as shown in fig. 5, the apparatus further includes:

a calculating unit 45, configured to calculate a second recognition error rate of each of the recognition result sets in different batches after performing labeling processing on the reference recognition text;

a second generating unit 46, configured to generate labeling fluctuations of labeling processes under different batches based on the second recognition error rate and the labeling result;

and a second evaluation unit 47 for evaluating the labeling result based on the labeling fluctuation.

The foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and the principle is the same, and this embodiment is not limited thereto.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 500 includes a computing unit 501 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 502 or a computer program loaded from a storage unit 508 into a RAM (Random Access Memory ) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An I/O (Input/Output) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a CPU (Central Processing Unit ), a GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, a DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the respective methods and processes described above, for example, the evaluation method of the speech recognition model. For example, in some embodiments, the method of evaluating a speech recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the foregoing method of evaluating the speech recognition model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of evaluating a speech recognition model, comprising:

dividing a voice data set to be marked into data blocks with preset quantity;

2. The method of claim 1, wherein after labeling the reference recognition text, evaluating each of the speech recognition models in turn based on the labeled reference recognition text comprises:

3. The method of claim 1, wherein dividing the voice data set to be annotated into a preset number of data segments comprises:

4. The method of claim 1, wherein the step of inputting the predetermined number of data blocks into a corresponding number of speech recognition models, respectively, to obtain a recognition result set corresponding to each of the speech recognition models, respectively, includes:

5. The method of claim 1, wherein the determining a recognition result from the respective sets of recognition results as the reference recognition text comprises:

6. The method of claim 1, wherein after labeling the reference identified text, the method further comprises:

and evaluating the labeling result based on the labeling fluctuation.

7. An evaluation device of a speech recognition model, comprising:

a generating unit for respectively determining one recognition result from the recognition result sets as a reference recognition text;

and the evaluation unit is used for evaluating each voice recognition model in turn based on the labeled reference recognition text after the labeling processing is carried out on the reference recognition text.

8. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.