CN110610698B

CN110610698B - Voice labeling method and device

Info

Publication number: CN110610698B
Application number: CN201910867063.XA
Authority: CN
Inventors: 汪俊; 闫博群; 李索恒; 张志齐
Original assignee: Shanghai Yitu Information Technology Co ltd
Current assignee: Shanghai Yitu Information Technology Co ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2022-09-27
Anticipated expiration: 2039-09-12
Also published as: CN110610698A

Abstract

The invention provides a voice labeling method and a voice labeling device in a real-time embodiment, which relate to the technical field of information, and the method comprises the following steps: acquiring voice information to be marked; inputting the voice information to be marked into a voice recognition model to obtain a voice recognition result, wherein the voice recognition result at least comprises a candidate marking result, the candidate recognition result is a plurality of recognition results determined by the voice recognition model aiming at the same voice sub-information to be marked, and the voice sub-information to be marked is part or all of the voice information to be marked; receiving a second labeling result determined by a labeling person aiming at the candidate labeling result; and determining the labeling result of the voice information to be labeled according to the second labeling result. The efficiency of the labeling process is improved by using a man-machine interaction mode.

Description

Voice labeling method and device

Technical Field

The embodiment of the invention belongs to the technical field of information, and particularly relates to a voice labeling method and device.

Background

With the development of communication technology and the popularization of intelligent terminals, various network communication tools become one of the main tools for public communication. The operation and transmission convenience of voice information become the main transmission information of various network communication tools. When various network communication tools are used, a process of converting the voice information into text is also involved, and the process is a voice recognition technology.

In the speech recognition technology, a speech recognition model is usually required to be trained, and mass speech data needs to be labeled when the speech recognition model is trained. However, in the prior art, manual labeling is usually adopted, so that the work efficiency of voice data labeling is low, and the error rate is high.

Disclosure of Invention

The embodiment of the invention provides a voice labeling method and device, which can improve the efficiency of a voice labeling process and improve the accuracy of voice labeling.

In one aspect, an embodiment of the present invention provides a method for voice tagging, where the method includes:

acquiring voice information to be marked;

inputting the voice information to be marked into a voice recognition model to obtain a voice recognition result, wherein the voice recognition result at least comprises a candidate marking result, the candidate recognition result is a plurality of recognition results determined by the voice recognition model aiming at the same voice sub-information to be marked, and the voice sub-information to be marked is part or all of the voice information to be marked;

receiving a second labeling result determined by a labeling person aiming at the candidate labeling result;

and determining the labeling result of the voice information to be labeled according to the second labeling result.

Optionally, the receiving a second annotation result determined by the annotator includes:

playing the voice sub-information to be marked corresponding to the candidate marking result;

and receiving the second labeling information determined by the labeling personnel according to the played sub-information of the voice to be labeled.

Optionally, the inputting the to-be-labeled voice information into a voice recognition model to obtain a voice recognition result includes:

and inputting each piece of to-be-labeled voice sub-information of the to-be-labeled voice information into each voice recognition sub-model in the voice recognition models, and recognizing the to-be-labeled voice sub-information by each voice recognition sub-model to obtain the same recognition result of each voice recognition sub-model and different recognition results of each voice recognition sub-model, wherein the different recognition results of each voice recognition sub-model are used as the candidate labeling results.

Optionally, the determining, according to the second annotation result, an annotation result of the speech information to be annotated includes:

and determining the labeling result of the voice information to be labeled according to the same result of the recognition of each voice recognition submodel and the second labeling result.

Optionally, after determining the labeling result of the voice information to be labeled according to the second labeling result, the method further includes:

and training the voice recognition model according to the labeling result of the voice information to be labeled.

In one aspect, an embodiment of the present invention further provides a voice annotation device, where the device includes:

the acquiring unit is used for acquiring the voice information to be marked;

the recognition unit is used for inputting the voice information to be labeled into a voice recognition model to obtain a voice recognition result, wherein the voice recognition result at least comprises a candidate labeling result, the candidate recognition result is a plurality of recognition results determined by the voice recognition model aiming at the same voice sub-information to be labeled, and the voice sub-information to be labeled is part or all of the voice information to be labeled;

the receiving unit is used for receiving a second labeling result determined by a labeling person aiming at the candidate labeling result;

and the determining unit is used for receiving a second labeling result determined by the labeling personnel according to the candidate labeling result.

Optionally, the receiving unit is specifically configured to:

playing the sub-information of the voice to be marked corresponding to the candidate marking result;

Optionally, the identification unit is specifically configured to:

Optionally, the determining unit is specifically configured to:

and identifying the same result and the second labeling result according to the voice identification submodels to determine the labeling result of the voice information to be labeled.

Optionally, the apparatus further comprises:

and the training unit is used for training the voice recognition model according to the labeling result of the voice information to be labeled.

In one aspect, an embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the voice annotation method when executing the program.

In one aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program executable by a computer device, and when the program runs on the computer device, the computer device is caused to execute the steps of the voice annotation method.

In the embodiment of the invention, the voice information to be marked is primarily marked through the voice recognition model, in the marking process, the voice recognition model determines a plurality of recognition results for the same voice sub-information to be marked, the part of the recognition results needs to be determined in a manner of coordinating with manual marking, and the contents of other parts do not need to be determined in a manual manner.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is an application scenario architecture diagram according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a voice annotation method according to an embodiment of the present invention;

FIG. 3 is a recognition result of a speech recognition model according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a user inputting a recognition result according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a voice tagging scene according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a voice annotation apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

To facilitate an understanding of the embodiments of the present invention, a few concepts are briefly introduced below:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Speech recognition technology, technology that lets machines convert speech signals into corresponding text or commands through a recognition and understanding process, and language spoken by humans through speech signal processing and pattern recognition. The speech recognition technology is a very extensive interdisciplinary subject, and has a very close relation with the subjects of acoustics, phonetics, linguistics, information theory, pattern recognition theory, neurobiology and the like.

And voice labeling, wherein an acoustic model is usually used in a voice recognition process, the establishment of the acoustic model depends on a large amount of voice data and correct text information corresponding to the voice data, so that a statistical relationship between voice and characters is obtained, and the model is trained by using the voice data and the correct text information corresponding to the voice data to obtain the acoustic model. The process of determining the correct text information corresponding to the voice data from the voice data is called voice tagging.

In a specific practical process, the applicant of the present application finds that, in the current voice labeling method, a manual dictation mode is often adopted to transcribe voice data into text information to obtain labeling information. However, when the number of the required text-to-speech pairs is large, the manual labeling method has the problems of low efficiency, high labor cost and poor accuracy.

Based on the defects of the prior art, the applicant of the application designs a voice labeling method, labeling is carried out through a voice recognition model, and when the voice recognition model cannot be recognized accurately, recognition can be assisted through a human-computer interaction mode, so that the efficiency and the accuracy of voice labeling are improved.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The training method of the speech processing model in the embodiment of the present application may be applied to an application scenario as shown in fig. 1, where the application scenario includes a terminal device 101 and a speech annotation server 102. The terminal device 101 is connected to the voice labeling server 102 through a wireless or wired network, and the terminal device 101 includes but is not limited to an intelligent device such as an intelligent sound box, an intelligent watch, an intelligent home, an intelligent robot, an AI customer service system, a bank credit card ordering telephone system, and an electronic device such as a smart phone, a mobile computer, and a tablet computer having a voice interaction function. The voice annotation server 102 may provide related voice servers, such as voice recognition, voice synthesis, and the like, and the voice annotation server 102 may be a server, a server cluster composed of a plurality of servers, or a cloud computing center.

It should be noted that the architecture diagram in the embodiment of the present application is for more clearly illustrating the technical solution in the embodiment of the present invention, and does not limit the technical solution provided in the embodiment of the present application, and for other application scenario architectures and business applications, the technical solution provided in the embodiment of the present application is also applicable to similar problems.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of these steps is not limited to the order of execution provided by the embodiments of the present application.

Based on the application scenario diagram shown in fig. 1, an embodiment of the present invention provides a voice annotation method, where a flow of the method may be executed by a voice annotation device, as shown in fig. 2, the method includes:

step S201, acquiring the voice information to be annotated.

Specifically, in the embodiment of the present invention, the voice information to be annotated may be obtained from a voice database, or may be voice information obtained from various audio channels, for example, from videos and audios of a network, or from channels such as a broadcast television.

Optionally, in the embodiment of the present invention, if the audio length of the obtained to-be-labeled voice information is greater than the threshold, the to-be-labeled voice information may be segmented into a plurality of segmented to-be-labeled voice information. And the audio lengths of the segmented voice information to be marked are approximately the same.

Step S202, inputting the voice information to be labeled into a voice recognition model to obtain a voice recognition result, wherein the voice recognition result at least comprises a candidate labeling result, the candidate recognition result is a plurality of recognition results determined by the voice recognition model aiming at the same voice sub-information to be labeled, and the voice sub-information to be labeled is part or all of the voice information to be labeled.

In the embodiment of the invention, the voice recognition model can deeply learn models, such as a Convolutional Neural Network (CNN) model, a Recurrent Neural Network (RNN) model and the like, has voice recognition capability and can more accurately determine text information corresponding to voice information.

In the embodiment of the invention, when the voice recognition model determines a plurality of recognition results aiming at the same to-be-labeled voice sub-information, the plurality of recognition results are taken as candidate labeling results.

That is to say, in the embodiment of the present invention, when performing speech recognition on a certain part of speech information to be labeled, a plurality of results are recognized, and these results are used as candidate labeling results.

In an alternative embodiment, the voice information to be labeled may be input into the voice recognition model to obtain a primary recognition result, and then the voice information to be labeled is input into the voice recognition model to obtain a primary recognition result, and the voice recognition result is obtained through multiple times of recognition. In the process of multiple recognition, if the recognition result of a certain time is different from other recognition results, the part is taken as a candidate labeling result.

For example, in the embodiment of the present invention, the voice information to be labeled is input into the voice recognition model, the obtained voice recognition result is "hello", then the voice information to be labeled is input into the voice recognition model, the obtained recognition result is "hello", and both "you" and "you" are used as candidate labeling results.

In another alternative embodiment, the speech recognition model includes a plurality of sub-models, each of which recognizes the speech information to be labeled, the sub-model a recognizes "i and you are good friends", the sub-model B recognizes "i and you are fat friends", the sub-model C recognizes "i and you are bad friends", so "you", "yes", "not", "punk", and "fat" are used as candidate labeling results.

Step S203, aiming at the candidate annotation result, receiving a second annotation result determined by the annotation personnel.

Specifically, in the embodiment of the present invention, the part that cannot be accurately recognized by the speech recognition model is determined by the annotating person.

In an optional embodiment, the candidate labeling result is displayed, the sub-information of the voice to be labeled corresponding to the candidate labeling result is played, and the second labeling result is determined according to the recognition result determined after the labeling personnel listen to the sub-information of the voice to be labeled corresponding to the candidate labeling result.

For example, in the embodiment of the present invention, if the candidate identification result is "you", "yes", "no", "punk", or "fat", the sub-information of the voice to be labeled corresponding to "you" and "you" is played, and the labeling person determines that the second labeled information corresponding to the sub-information of the voice to be labeled is "you"; and playing the sub-information of the voice to be marked corresponding to yes or no, determining that the second marking information corresponding to the sub-information of the voice to be marked is yes by the marking personnel, playing the sub-information of the voice to be marked corresponding to punish or fat, and determining that the second marking information corresponding to the sub-information of the voice to be marked is punish by the marking personnel.

In the embodiment of the present invention, as shown in fig. 3, the recognition result of the speech recognition model may be displayed, the display content includes the candidate tagging result and the partial recognition result determined by the speech recognition model, then when the tagging personnel clicks the candidate tagging result portion, the speech is played, and then the user is prompted to select one of the candidate tagging results as the second tagging information.

Optionally, in the embodiment of the present invention, as shown in fig. 4, the display content may further include an "other" option, and when the annotator clicks the "other" option, the input content of the annotator is received, and the input content of the annotator is used as the second annotation information.

Optionally, in the embodiment of the present invention, selectable attributes of the voice information to be annotated may also be displayed, and then the selected attributes of the annotating personnel are received, for example, the selectable attributes include a gender of a speaker corresponding to the voice to be annotated, an accent of the speaker, whether a noise exists in a sound-emitting environment, whether the sound is generated by a single speaker, and the like.

Optionally, in the embodiment of the present invention, since there may be a case that the sub-information of the voice to be labeled corresponding to the candidate labeling result cannot be labeled, an option of "unable to label" may be further added to the display content, and it may be considered that the voice information to be labeled cannot be labeled.

And step S204, determining the labeling result of the voice information to be labeled according to the second labeling result.

And after receiving a second labeling result of the labeling personnel, determining the accurate part of the speech recognition model and the second labeling result to be labeled as the labeling result of the speech information to be labeled.

For example, the annotating person determines that the second annotation information corresponding to the sub-information of the voice to be annotated is "you", "yes" or "pund", and the annotation result of the sub-information of the voice to be annotated is "i and you are good friends".

In the embodiment of the invention, after the labeling result of the voice information to be labeled is determined, the next voice information to be labeled can be input into the voice recognition model, and then the next voice information to be labeled is recognized to obtain the labeling result.

In the embodiment of the invention, after all the voice information to be marked which needs to be marked is marked, the voice recognition model can be trained according to all the voice information to be marked which is marked, so that the recognition capability of the voice recognition model is improved, and then the voice information to be marked is continuously marked by utilizing the trained voice recognition model.

In order to better explain the embodiment of the present application, a voice tagging method provided by the embodiment of the present application is described below in combination with a specific implementation scenario, as shown in fig. 5, in the embodiment of the present invention, a to-be-tagged voice message is tagged through a tagging page, the tagging page is connected to a tagging server, and tagging is completed through the tagging server. The annotation server at least comprises a speech recognition model.

The labeling page shown in fig. 5 includes a playing portion of the voice information to be labeled, a recognition result portion of the voice recognition model, and a manual labeling portion. The playing part of the voice information to be marked can play the part which can not be accurately identified by the voice recognition model, the identification result part of the voice recognition model comprises the accurate result identified by the voice recognition model and the inaccurate result identified by the voice recognition model, and the inaccurate result can be played through the playing part of the voice information to be marked.

The manual labeling part is used for receiving the selection of a labeling person, and the labeling person determines the part which cannot be accurately identified by the voice recognition model as an accurate result and further can determine each attribute of the voice information to be labeled.

In the embodiment of the present invention, first, the speech information to be labeled is input, the speech recognition result obtained by the speech recognition model is displayed in the recognition result portion of the speech recognition model, the content of the portion includes a plurality of portions with inaccurate recognition, the first portion with inaccurate recognition includes four results, which are "acquisition", "acquisition case", "acquisition press" and "acquisition o", respectively, the second portion with inaccurate recognition includes three results, which are: "spoofing", "sucking", "adsorbing", the third part of the inaccurate identification comprises three results, respectively: "mobile", "human", "ground", and the fourth part of the identified inaccuracy includes two results, respectively: "Bar" and "eight".

When the annotating personnel clicks the inaccurate part, the audio information corresponding to the inaccurate part can be played, and the annotating personnel can select the playing position by dragging the playing progress.

After the annotation personnel selected the recognition result, show in the recognition result display position to also show other inaccurate recognition results, it is specific, according to the recognition result of annotation personnel and speech recognition model, the recognition result of waiting to annotate speech information is "purchase this will not be the company bar that wants to deceive.

Meanwhile, the marking personnel also select the attributes of the voice to be marked, and the voice producing personnel are female, northeast accent, the environment of the voice producing personnel has no noise, the environment of the voice producing personnel has no abnormality, and the voice producing personnel produces the voice independently.

And the marking of the voice information to be marked is finished through the voice recognition model and a marking person.

Based on the foregoing embodiment, referring to fig. 6, an embodiment of the present invention provides a speech tagging apparatus 600, including:

an obtaining unit 601, configured to obtain voice information to be annotated;

the recognition unit 602 is configured to input the to-be-labeled voice information into a voice recognition model to obtain a voice recognition result, where the voice recognition result at least includes a candidate labeling result, the candidate recognition result is multiple recognition results determined by the voice recognition model for the same to-be-labeled voice sub-information, and the to-be-labeled voice sub-information is part or all of the to-be-labeled voice information;

a receiving unit 603, configured to receive, for the candidate annotation result, a second annotation result determined by an annotation worker;

a determining unit 604, configured to receive, for the candidate annotation result, a second annotation result determined by the annotator.

Optionally, the receiving unit 603 is specifically configured to:

Optionally, the identifying unit 601 is specifically configured to:

Optionally, the determining unit 604 is specifically configured to:

Optionally, the apparatus further comprises:

a training unit 605, configured to train the speech recognition model according to the labeling result of the speech information to be labeled.

Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 7, including at least one processor 701 and a memory 702 connected to the at least one processor, where a specific connection medium between the processor 701 and the memory 702 is not limited in this embodiment, and the processor 701 and the memory 702 in fig. 7 are connected through a bus as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present application, the memory 702 stores instructions executable by the at least one processor 701, and the at least one processor 701 may execute the steps included in the aforementioned voice annotation method by executing the instructions stored in the memory 702.

The processor 701 is a control center of the computer device, and may connect various parts of the terminal device by using various interfaces and lines, and obtain the client address by executing or executing the instructions stored in the memory 702 and calling the data stored in the memory 702. Alternatively, the processor 701 may include one or more processing units, and the processor 701 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, and the like, and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 701. In some embodiments, processor 701 and memory 702 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 701 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, configured to implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 702, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 702 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 702 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 702 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same technical concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program runs on the computer device, causes the computer device to execute the steps of the voice annotation method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for voice annotation, the method comprising:

acquiring voice information to be marked;

inputting each piece of to-be-labeled voice sub-information of the to-be-labeled voice information into each voice recognition sub-model in a voice recognition model, and recognizing the to-be-labeled voice sub-information by each voice recognition sub-model to obtain the same recognition result of each voice recognition sub-model and the different recognition result of each voice recognition sub-model, wherein the different recognition results of each voice recognition sub-model are used as candidate labeling results, the candidate labeling results are a plurality of recognition results determined by the voice recognition model for the same to-be-labeled voice sub-information, and the to-be-labeled voice sub-information is part or all of the to-be-labeled voice information;

2. The method of claim 1, wherein the receiving of the second annotation result determined by the annotator comprises:

and receiving the second labeling result determined by the labeling personnel according to the played voice sub-information to be labeled.

3. The method of claim 1, wherein after the identifying the same result according to each sub-speech recognition model and the second labeling result determine the labeling result of the speech information to be labeled, the method further comprises:

4. A voice annotation apparatus, the apparatus comprising:

the acquiring unit is used for acquiring the voice information to be marked;

the recognition unit is used for inputting each piece of to-be-annotated voice sub-information of the to-be-annotated voice information into each voice recognition sub-model in a voice recognition model, and each voice recognition sub-model recognizes the to-be-annotated voice sub-information to obtain the same recognition result of each voice recognition sub-model and the different recognition result of each voice recognition sub-model, wherein the different recognition results of each voice recognition sub-model are used as candidate annotation results, the candidate annotation results are a plurality of recognition results determined by the voice recognition model for the same to-be-annotated voice sub-information, and the to-be-annotated voice sub-information is part or all of the to-be-annotated voice information;

and the determining unit is used for determining the labeling result of the voice information to be labeled according to the same recognition result of each voice recognition submodel and the second labeling result.

5. The apparatus according to claim 4, wherein the receiving unit is specifically configured to:

6. The apparatus of claim 4, further comprising:

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 3 are performed when the program is executed by the processor.

8. A computer-readable storage medium, storing a computer program executable by a computer device, the program, when executed on the computer device, causing the computer device to perform the steps of the method of any one of claims 1 to 3.