CN113611311A

CN113611311A - Voice transcription method, device, recording equipment and storage medium

Info

Publication number: CN113611311A
Application number: CN202110963345.7A
Authority: CN
Inventors: 王志军; 任晓宁; 崔浩然; 孔常青
Original assignee: Tianjin Xunfeiji Technology Co ltd
Current assignee: Tianjin Xunfeiji Technology Co ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-05

Abstract

The invention provides a voice transcription method, a voice transcription device, a recording device and a storage medium, wherein the method comprises the following steps: determining audio to be transcribed; and performing offline voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, wherein the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed, and the offline voice transcription which has stronger instantaneity and higher safety and does not depend on a network any more is realized through the locally stored voice transcription engine. In the off-line voice transcription process, the transcription efficiency of the voice transcription engine under different load amounts and the power consumption of the intelligent device are considered, so that when the voice transcription engine calls the allocated operation resources to perform voice transcription on the audio to be transcribed, the relation between the power consumption of the intelligent device and the transcription efficiency can be balanced, the power consumption of the intelligent device is reduced on the premise of ensuring the transcription efficiency, and the conditions of serious heating and poor cruising are relieved.

Description

Voice transcription method, device, recording equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech transcription method, apparatus, recording device, and storage medium.

Background

The recording pen is widely applied to occasions such as conferences, lecture recording, interviews, classrooms and the like by virtue of the advantages of convenience in carrying, simplicity in operation and the like.

However, the mainstream recording pen in the market is the traditional recording pen, the function of the traditional recording pen is limited to the recording layer, and if voice transcription is to be performed, the recorded audio file needs to be manually converted into characters after the recording is completed, or the audio data is uploaded to the cloud for transcription.

However, the method is complex in operation and poor in flexibility, and the audio file is uploaded to the cloud for voice transcription, so that the timeliness of the voice transcription is greatly influenced by the network state, and the risk of revealing the privacy of the user exists.

Disclosure of Invention

The invention provides a voice transcription method, a voice transcription device, electronic equipment, recording equipment and a storage medium, which are used for solving the defects that in the prior art, online voice transcription depends on a network state, the stability is poor and the data security is poor.

The invention provides a voice transcription method, which comprises the following steps:

determining audio to be transcribed;

and performing offline voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, wherein the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed.

According to the voice transcription method provided by the invention, the operation resource of the voice transcription engine is determined based on the following steps:

determining the resource occupation ratio of the voice transcription engine based on the data specification of the audio to be transcribed, and the energy consumption state and/or the heating state of the equipment, or based on the data specification of the audio to be transcribed;

determining computational resources of the speech transcription engine from device computational resources based on the resource occupancy.

According to the voice transcription method provided by the invention, the data specification is determined based on the data volume of the audio to be transcribed and the storage volume of the storage space, and the storage space is used for storing the audio to be transcribed.

According to the voice transcription method provided by the invention, the off-line voice transcription of the audio to be transcribed is carried out based on the voice transcription engine stored locally, and the method comprises the following steps:

coding the acoustic characteristics of the audio to be transcribed based on an acoustic model in the voice transcription engine, and determining a first transcription result based on the acoustic characteristics;

decoding the acoustic features based on a decoding model in the voice transcription engine to obtain a second transcription result;

and fusing the first transcription result and the second transcription result based on the voice transcription engine to obtain the transcription text of the audio to be transcribed.

According to the voice transcription method provided by the invention, the acoustic model is obtained by distilling and training an original acoustic model based on sample audio.

According to the voice transcription method provided by the invention, the acoustic model is obtained by training based on the following steps:

determining a first acoustic feature probability distribution of the sample audio based on the original acoustic model;

determining a second acoustic feature probability distribution of the sample audio based on an acoustic model of a training phase;

determining a distillation loss value based on the first acoustic feature probability distribution and the second acoustic feature probability distribution;

and adjusting parameters of the acoustic model in the training stage based on the distillation loss value to obtain the acoustic model.

According to a voice transcription method provided by the present invention, the fusing the first transcription result and the second transcription result based on the voice transcription engine to obtain the transcription text of the audio to be transcribed comprises:

fusing the first transcription result and the second transcription result based on a text generation model in the voice transcription engine to obtain a transcription text of the audio to be transcribed;

the text generation model is trained by combining the acoustic model and the decoding model based on sample audio and sample transcription text of the sample audio.

The invention also provides a voice transcription device, comprising:

an audio determining unit for determining an audio to be transcribed;

and the voice transcription unit is used for performing off-line voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, and the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed.

The invention also provides a recording device, which comprises a pickup device, a memory, a processor and a computer program which is stored on the memory and can be run on the processor, wherein the pickup device is used for recording the audio to be transcribed, and the processor realizes the steps of the voice transcription method as described in any one of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the speech transcription method as described in any of the above.

The voice transcription method, the voice transcription device, the recording equipment and the storage medium provided by the invention realize the offline voice transcription which has stronger real-time performance and higher safety and does not depend on a network any more through the voice transcription engine which is locally stored. In the off-line voice transcription process, the operation resource of the voice transcription engine is determined according to the data specification of the audio to be transcribed, the transcription efficiency of the voice transcription engine under different load amounts and the power consumption of the intelligent device are considered, so that the relation between the power consumption of the intelligent device and the transcription efficiency can be balanced when the voice transcription engine calls the allocated operation resource to perform voice transcription on the audio to be transcribed, the power consumption of the intelligent device is reduced on the premise that the transcription efficiency is guaranteed, and the conditions of serious heating and poor endurance are relieved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a voice transcription method provided by the present invention;

FIG. 2 is a schematic diagram of a process for determining computational resources of a speech transcription engine provided by the present invention;

FIG. 3 is a flow chart illustrating step 120 of the voice transcription method provided by the present invention;

FIG. 4 is a schematic diagram of the determination process of an acoustic model provided by the present invention;

FIG. 5 is a flow chart of the training of an acoustic model provided by the present invention;

FIG. 6 is a schematic flow chart of transcription result fusion provided by the present invention;

FIG. 7 is an overall framework diagram of the voice transcription method provided by the present invention;

FIG. 8 is a schematic structural diagram of a voice transcription apparatus provided in the present invention;

FIG. 9 is a schematic structural diagram of an audio recording apparatus according to the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, mainstream recording pen on the market is traditional recording pen, and traditional recording pen's function is confine to the recording aspect, if will change the sound into the characters, then need be by artifical audio file conversion who will record after the recording is accomplished the characters, and obviously, this kind of traditional recording pen that only possesses single function has restricted the flexibility when the user uses, not only can lead to the availability factor low, still can cause extra repetitive work, has consumeed user's time and energy in a large number.

In addition, there is a recording pen with voice transcription function in the market, and the voice transcription function in the recording pen is realized by exporting the recorded audio file, uploading the exported audio file to the cloud, and realizing the voice transcription function by the cloud, compared with the method of manually performing voice transcription, although the use time and the energy of the user are saved to a certain extent, the process of voice transcription is still more complicated, the voice transcription at the cloud is relied on, the transmission of the audio file is inevitably required to be applied to the network, the transmission rate is closely related to the state of the network, under the condition of a weak network or no network, the speed of file transmission can be seriously influenced, and the voice transcription result cannot be obtained in time due to file transmission delay, so that the use experience of a user is influenced. In addition, in the process of uploading the audio file to the cloud, the risk of revealing user information may exist, and data security cannot be guaranteed fundamentally.

In view of the above situation, the present invention provides a voice transcription method, which aims to realize off-line voice transcription. Fig. 1 is a schematic flow chart of a voice transcription method provided by the present invention, as shown in fig. 1, the method is applied to a voice transcription processor, the voice transcription processor can be disposed inside any intelligent device with a voice transcription function, where the intelligent device can be a recording pen, an intelligent mobile phone, an intelligent bracelet, etc., and the method includes:

step 110, determining audio to be transcribed;

specifically, before performing voice transcription, it is necessary to first determine audio to be subjected to voice transcription, that is, audio to be transcribed. The audio to be transcribed can be obtained through a sound pickup device, and the sound pickup device can be a microphone or a microphone array, and also can be an intelligent device such as a recording pen, a smart phone and a smart band provided with the microphone or the microphone array. After the pickup equipment obtains the audio frequency to be transcribed, the pickup equipment can also directly store the audio frequency to be transcribed in a storage unit in the intelligent equipment so as to be called by a voice transcription processor in the intelligent equipment.

After the voice transcription processor detects the audio to be transcribed, the voice transcription processor may further amplify and reduce noise of the audio to be transcribed, which is not specifically limited in the embodiment of the present invention.

It should be noted that the audio to be transcribed here may be a complete audio file generated after the sound pickup is finished, or may be an audio data stream continuously generated in the sound pickup process, and this is not specifically limited in the embodiment of the present invention.

And step 120, performing offline voice transcription on the audio to be transcribed based on the locally stored voice transcription engine, wherein the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed.

Specifically, considering that the current voice transcription is realized by relying on a cloud terminal, the timeliness of the voice transcription is affected by the network state to a great extent, and the cloud voice transcription needs to transmit the audio to be transcribed through the network, so that the privacy of a user is exposed. Therefore, in the embodiment of the invention, the voice transcription engine is locally stored, the off-line voice transcription of the audio to be transcribed can be realized only through the voice transcription engine, the off-line voice transcription does not need to be carried out through network transmission, and the whole transcription process is carried out locally in the intelligent equipment, so that the safety and the privacy of the off-line transcription mode are stronger, the real-time performance is higher and the network support is not needed compared with the on-line transcription mode, the risk that user information is leaked in the voice transcription process can be avoided, the problem that the data safety cannot be guaranteed is fundamentally solved, the problem that the voice transcription efficiency is low under the condition of a weak network or even no network is solved, and the use experience of a user is improved.

However, the intelligent device that needs to perform local offline voice transcription is usually a handheld mobile device, such as a recording pen, which has poor heat dissipation capability, and continuous heating not only consumes the electric quantity of the intelligent device, but also affects the efficiency of voice transcription; on the contrary, if the computational resources allocated to the speech transcription engine are limited, the audio to be transcribed stored in the intelligent device is gradually accumulated, the delay of speech transcription is increasingly large, the real-time performance and efficiency of speech transcription are greatly affected, and the user experience is poor.

In view of the problem, when the locally stored voice transcription engine is operated, the embodiment of the invention fully considers the balance between the voice transcription efficiency and the voice transcription power consumption, and particularly dynamically adjusts the operation resources of the voice transcription engine according to the data specification of the audio to be transcribed so as to coordinate the power consumption problem of the voice transcription engine in the intelligent device.

Here, the data specification of the audio to be transcribed can reflect the data size of the audio to be transcribed, and further reflect the amount of currently required computing resources. When the operation resources are allocated to the voice transcription audio, the data specification of the voice data stream can be referred to, for example, when the data specification is smaller, fewer voice transcription operation resources are allocated to the voice transcription engine, so that the waste of the operation resources is avoided, and the power consumption of intelligent equipment is reduced; when the data specification is larger, more voice transcription operation resources are distributed for the voice transcription engine, so that the timeliness of the voice transcription task is ensured.

The voice transcription method provided by the invention realizes the offline voice transcription which has stronger instantaneity and higher safety and does not depend on the network any more through the voice transcription engine which is locally stored. In the off-line voice transcription process, the operation resource of the voice transcription engine is determined according to the data specification of the audio to be transcribed, the transcription efficiency of the voice transcription engine under different load amounts and the power consumption of the intelligent device are considered, so that the relation between the power consumption of the intelligent device and the transcription efficiency can be balanced when the voice transcription engine calls the allocated operation resource to perform voice transcription on the audio to be transcribed, the power consumption of the intelligent device is reduced on the premise that the transcription efficiency is guaranteed, and the conditions of serious heating and poor endurance are relieved.

Based on the foregoing embodiment, fig. 2 is a schematic diagram of a determination process of an arithmetic resource of a speech transcription engine provided by the present invention, and as shown in fig. 2, the arithmetic resource of the speech transcription engine is determined based on the following steps:

step 210, determining the resource occupation ratio of the voice transcription engine based on the data specification of the audio to be transcribed, and the energy consumption state and/or the heating state of the equipment, or based on the data specification of the audio to be transcribed;

step 220, determining the operation resources of the voice transcription engine from the equipment computing resources based on the resource proportion.

Specifically, before performing voice transcription on the audio to be transcribed according to the voice transcription engine, the operation resource of the voice transcription engine needs to be determined, so that the voice transcription engine can call the allocated operation resource to perform voice transcription on the audio to be transcribed.

The calculation resources of the voice transcription engine are distributed in the corresponding equipment calculation resources according to the resource occupation ratio of the voice transcription engine. The resource occupation ratio here is a proportion of calculation resources required by the speech transcription engine in the device calculation resources, and the device calculation resources are all resources for calculation possessed by the smart device.

In step 210, the resource occupation ratio of the computational resources required by the speech transcription engine can be determined according to the data specification of the audio to be transcribed; on the basis, in consideration of the situation that the intelligent device may have voice transcription interruption caused by insufficient resources or insufficient electric quantity, in order to avoid the situation and prolong the endurance time of the intelligent device, the resource occupation ratio of the voice transcription engine can be determined according to the data specification of the audio to be transcribed and any one or the combination of two of the energy consumption state and the heating state of the device.

The device energy consumption state can reflect the current working state of the device, such as the remaining power of the device, the estimated sustainable working time of the device, and the like, when the operation resources are allocated to the voice transcription engine, the device energy consumption state can be referred to, for example, under the condition that the device power is sufficient, more operation resources can be allocated to the voice transcription engine, so that the offline voice transcription efficiency is improved, under the condition that the device power is insufficient, the device operation needs to be preferentially ensured, and less operation resources can be allocated to the voice transcription engine, so that the function of saving energy consumption is achieved.

The device heating state may reflect a current temperature state of the device, for example, a current temperature of the device, whether an overheating condition exists, and the like, and when the operation resource is allocated to the voice transcription engine, the device heating state may be referred to, for example, when the device is significantly overheated, continuously allocating more operation resources to the voice transcription engine may cause a continuous rise in the device temperature, and even affect a normal operation of the device, and at this time, the operation resource may be reduced, so as to alleviate a problem of the device overheating.

In addition, if the resource occupation ratio of the speech transcription engine is determined based on the data specification of the audio to be transcribed and any one or the combination of two of the equipment energy consumption state and the equipment heating state, corresponding weights can be respectively set for the data specification of the audio to be transcribed, the equipment energy consumption state and the equipment heating state according to the importance of the data specification of the audio to be transcribed, the equipment energy consumption state and the equipment heating state to the speech transcription, and the resource occupation ratio of the speech transcription engine is determined by combining the corresponding weights of the three.

For example, because the data specification of the audio to be transcribed indicates the accumulation degree of the audio to be transcribed, which has a great influence on the real-time performance of the voice transcription, the data specification of the audio to be transcribed has a higher priority, i.e., a higher importance degree, than the energy consumption state and the heating state of the equipment, and the data specification of the audio to be transcribed can be set with the highest weight; and because the electric quantity of the equipment can also influence the voice transcription, especially the voice transcription can not be carried out when the electric quantity of the equipment is insufficient, the importance of the energy consumption state of the equipment is higher than that of the heating state of the equipment, and the next highest weight can be set for the energy consumption state of the equipment.

After the resource occupation ratio of the speech transcription engine obtained in step 210 is obtained, in step 220, the operation resource of the speech transcription engine is determined from the equipment calculation resource according to the resource occupation ratio of the speech transcription engine, specifically, the operation resource of the speech transcription engine is obtained by multiplying the resource occupation ratio of the speech transcription engine by the equipment calculation resource.

The method provided by the embodiment of the invention determines the resource occupation ratio of the voice transcription engine based on the data specification of the audio to be transcribed, or determines the resource occupation ratio of the voice transcription engine based on the data specification of the audio to be transcribed, the energy consumption state of equipment and/or the heating state of the equipment, and integrates the resource occupation ratio of the voice transcription engine determined by multiple factors, so that the voice transcription engine can save the energy consumption of the equipment to the maximum extent and prolong the endurance time of the equipment on the premise of ensuring the transcription efficiency when calling the operation resources determined according to the resource occupation ratio to perform voice transcription, and can also meet the transcription requirements under different scenes, and improve the flexibility and the application range of the voice transcription.

Based on the above-described embodiment, the data specification is determined based on the data amount of the audio to be transcribed and the storage amount of the storage space for storing the audio to be transcribed.

Specifically, after the audio to be transcribed is determined in step 110, the data size of the audio to be transcribed and the storage amount of the storage space need to be determined, so that the data specification of the audio to be transcribed can be determined according to the data size of the audio to be transcribed and the storage amount of the storage space, and the calculation resource of the voice transcription engine can be determined according to the data specification of the audio to be transcribed subsequently. Here, the data amount of the audio to be transcribed, that is, the data amount of the audio to be transcribed, which needs to be subjected to the voice transcription, stored in the storage space may be represented by a frame number, and the storage amount of the storage space, that is, the data amount of the storage space, which can store the audio to be transcribed.

Based on the above embodiment, the operation resources of the speech transcription engine can be calculated by the following formula:

wherein N represents the operation resource of the voice transcription engine, N represents the device calculation resource, M represents the data amount of the audio to be transcribed stored in the storage space, M represents the storage amount of the storage space, and a represents the adjustment coefficient.

When the data volume of the audio to be transcribed, which is stored in the storage space, is the maximum, the operation resource of the voice transcription engine is the maximum and is N; when the data volume of the audio to be transcribed stored in the storage space is empty, the computational resource of the voice transcription engine is minimum, and the computational resource is dynamically allocated to the voice transcription engine according to the data specification of the audio to be transcribed, so that the overall power consumption of the intelligent equipment is effectively reduced, and the conditions of serious heating and poor cruising effect are relieved.

Based on the above embodiment, fig. 3 is a schematic flowchart of step 120 in the voice transcription method provided by the present invention, and as shown in fig. 3, step 120 includes:

step 121, based on the acoustic model in the speech transcription engine, encoding the acoustic features of the audio to be transcribed, and determining a first transcription result based on the acoustic features;

step 122, decoding the acoustic features based on a decoding model in the speech transcription engine to obtain a second transcription result;

and step 123, fusing the first transcription result and the second transcription result based on the voice transcription engine to obtain the transcription text of the audio to be transcribed.

Considering that an intelligent device may face various complex scenes such as an over-noise situation, a multi-person conference, a voice bye, a language mixed utterance, and the like in an actual use process, in such scenes, the quality of recorded audio is low, and if voice transcription is performed by using a conventional voice transcription method, a situation that a large number of transcription errors exist in a transcription result is caused, so that the accuracy of the voice transcription result is seriously affected.

Specifically, after the operation resources of the speech transcription engine are determined, the speech transcription can be performed on the audio to be transcribed. In step 121, according to the acoustic feature of the speech transcription engine, the acoustic feature of the audio to be transcribed may be specifically an acoustic model that inputs the audio to be transcribed into the speech transcription engine, the acoustic model encodes the input audio to be transcribed, so as to obtain the acoustic feature of the audio to be transcribed, and determine a first transcription result of the audio to be transcribed according to the acoustic feature of the audio to be transcribed. The first transcription result is determined based on the acoustic features output by the acoustic model, that is, the first transcription result is the result of performing traditional voice transcription on the acoustic level.

After obtaining the acoustic features through the acoustic model in step 121, in step 122, the acoustic features of the audio to be transcribed in step 121 may be decoded according to a decoding model in the speech transcription engine, specifically, the acoustic features of the audio to be transcribed are input to the decoding model in the speech transcription engine, and the decoding model decodes the input acoustic features of the audio to be transcribed, so as to obtain a second transcription result of the audio to be transcribed, and output the second transcription result. The second transcription result is determined based on the acoustic model and the decoding model, and the combination of the acoustic model and the decoding model can be regarded as an end-to-end codec model, where the acoustic model serves as an Encoder and the decoding model serves as a Decoder, that is, the second transcription result is the result of speech transcription based on the end-to-end codec model.

After the first transcription result and the second transcription result are obtained in

steps

121 and 122, in step 123, the two transcription results respectively obtained by different voice transcription modes of the first transcription result and the second transcription result can be fused, so that the transcription text which combines the traditional voice transcription thought and the end-to-end voice transcription thought based on the acoustic level is obtained, and the problem that a large number of transcription errors may exist in a complex scene is solved.

The voice transcription method provided by the embodiment of the invention combines the acoustic model and the decoding model, and combines the traditional voice transcription thought based on the acoustic level and the end-to-end voice transcription thought to perform voice transcription, so that the finally obtained transcription text of the audio to be transcribed can overcome the problem of low accuracy of the transcription result obtained in the traditional scheme, and the problem of a large number of transcription errors possibly existing in a complex scene is alleviated.

Based on the above embodiment, the acoustic model is obtained by performing distillation training on the original acoustic model based on the sample audio.

Specifically, it is considered that an intelligent device which needs to perform local offline voice transcription is usually a handheld mobile device, and the local processing capability of the handheld mobile device is relatively limited, so that an acoustic model which is often used in a cloud is large in scale, and is difficult to directly deploy in the local of the intelligent device. Therefore, there is a need to reduce the model size of locally deployed acoustic models.

Here, the original acoustic model is an acoustic model that is larger in scale, more complex, and more effective in performing a task than an acoustic model in a speech transcription engine. In order to deploy an acoustic model with a smaller scale locally, knowledge migration can be performed by using the idea of a teacher-student network, wherein the knowledge migration refers to the migration of knowledge in a teacher model to a student model so as to improve the network performance of the student model, the teacher model is an original acoustic model, the student model is an acoustic model which is finally deployed locally, and the knowledge migration process is knowledge distillation. And the acoustic model is obtained through distillation training of the original acoustic model, and the performance of the acoustic model is closer to that of the original acoustic model.

Before step 121 is executed, pre-training is further performed to obtain a locally stored acoustic model, and the specific training mode may be: firstly, a large amount of sample audios are collected, data processing is carried out on the sample audios, sample acoustic characteristics of the sample audios and sample transcription texts of the sample audios are obtained, and an original acoustic model is obtained, wherein the original acoustic model can be obtained through training according to the sample audios and the sample transcription texts of the sample audios, and can also be obtained directly through the original acoustic model deployed at the cloud. And then, performing distillation training on the original acoustic model based on the sample audio and the first acoustic feature probability distribution of the original acoustic model aiming at the sample audio output, so as to obtain the trained acoustic model.

The data processing of the sample audio is mainly to perform corresponding labeling processing on the sample audio, namely, segmenting the labeled sample audio according to boundaries to obtain a small sentence, and performing windowing and framing processing on each small sentence to obtain the sample acoustic characteristics of the sample audio. The sample acoustic features may be Filter Bank Filter banks or MFCC features, which are not specifically limited in this embodiment of the present invention. After the acoustic features of the sample audio are determined, the characters of the labeled sample audio are converted into sample transcription texts corresponding to the acoustic features of the sample audio, and the method for converting the characters can be a forced segmentation algorithm.

According to the method provided by the embodiment of the invention, the acoustic model is obtained by carrying out distillation training on the original acoustic model, the model scale is compressed and the calculation amount is reduced while the accuracy of model transcription is ensured.

Based on the above embodiment, fig. 4 is a schematic diagram of a determination process of an acoustic model provided by the present invention, and as shown in fig. 4, the acoustic model is obtained based on the following training steps:

step 410, determining a first acoustic feature probability distribution of the sample audio based on the original acoustic model;

step 420, determining a second acoustic feature probability distribution of the sample audio based on the acoustic model in the training stage;

step 430, determining a distillation loss value based on the first acoustic characteristic probability distribution and the second acoustic characteristic probability distribution;

and 440, adjusting parameters of the acoustic model in the training stage based on the distillation loss value to obtain the acoustic model.

Specifically, before step 121 is executed, the acoustic model in the speech transcription engine may also be trained in advance, and the training process of the acoustic model is actually to adjust model parameters of the acoustic model according to distillation loss values of the acoustic model in the training stage, so as to obtain the acoustic model. The distillation loss value is determined based on a first acoustic characteristic probability distribution and a second acoustic characteristic probability distribution which are respectively output by the original acoustic model and the acoustic model.

Fig. 5 is a flow chart of training an acoustic model provided in the present invention, and as shown in fig. 5, in the training process of the acoustic model:

in step 410, the sample audio may be input to an original acoustic model, and the original acoustic model analyzes the input sample audio and outputs an acoustic feature probability distribution of the sample audio, which is recorded as a first acoustic feature probability distribution.

In step 420, the sample audio may be input to the acoustic model in the training stage, the acoustic model in the training stage analyzes the sample audio, and the acoustic feature probability distribution of the sample audio is output and recorded as a second acoustic feature probability distribution. Here, the first acoustic feature probability distribution and the second acoustic feature probability distribution are both used to reflect the probability or score that each audio frame in the sample audio belongs to various acoustic states, and the first acoustic feature probability distribution and the second acoustic feature probability distribution are different in models that output the two, the first acoustic feature probability distribution is from an original acoustic model that assumes a teacher role, and the second acoustic feature probability distribution is from an acoustic model that assumes a training phase of a student role.

In step 430, after the acoustic feature probability distributions for the same sample audio are obtained based on the original acoustic model and the acoustic model, the distillation loss value of the speech transcription task can be determined by combining the difference between the first acoustic feature probability distribution and the second acoustic feature probability distribution. For example, the distillation loss value can be expressed by using KLD (Kullback-Leibler Distance, Distance of cross entropy) criterion.

In step 440, the distillation loss value obtained in step 430 may be applied to the acoustic model in the training stage, that is, the acoustic model in the training stage is subjected to parameter adjustment, so as to obtain the acoustic model. It should be noted that the acoustic model in the training phase may be constructed according to the device computing resources.

According to the method provided by the embodiment of the invention, on the basis of a traditional acoustic model, an original acoustic model aiming at acoustic characteristics is introduced, a distillation loss value is determined according to a first acoustic characteristic probability distribution output by the original acoustic model and a difference between second acoustic characteristic probability distributions output by the acoustic model, and parameter iteration is carried out on the acoustic model in a training stage according to the distillation loss value, so that the acoustic model is obtained, and the realization effect of voice transcription through the acoustic model is improved on the premise of ensuring that the calculation amount and the model scale of the acoustic model in a voice transcription engine are as small as possible.

Based on the above embodiment, step 123 includes:

the text generation model is trained by combining an acoustic model and a decoding model based on sample audio and sample transcription text of the sample audio.

Specifically, after the first transcription result and the second transcription result are obtained in

steps

121 and 122, in order to improve the accuracy of the transcription result, the first transcription result and the second transcription result may be fused according to a text generation model of the speech transcription engine. Fig. 6 is a schematic flow chart of the fusion of the transcription results provided by the present invention, and as shown in fig. 6, a specific fusion process is to input the first transcription result and the second transcription result into a text generation model in the speech transcription engine, perform text error correction on the first transcription result and the second transcription result by using the text generation model in the speech transcription engine, generate a transcription text, and finally obtain the transcription text of the audio to be transcribed output by the text generation model.

Before the first transcription result and the second transcription result are fused according to a text generation model in the voice transcription engine, the text generation model can be obtained through pre-training, and the training method of the text generation model comprises the following steps: firstly, a large amount of sample audios are collected, and data processing is carried out on the sample audios to obtain sample transcription texts of the sample audios. And then training the initial text generation model based on the sample audio and the sample transcription text by combining the acoustic model and the decoding model to obtain a trained text generation model. Specifically, in the process of joint training, the acoustic model may output an acoustic feature and a first transcription result based on the sample audio, the decoding model may output a second transcription result based on the acoustic feature of the sample audio, the initial text generation model may output a transcription text based on the first transcription result and the second transcription result of the sample audio, and then a loss value of the initial text generation model is determined based on the sample transcription text of the sample audio and the transcription text output by the initial text generation model, so that parameter iteration is performed on the initial text generation model, thereby obtaining the text generation model.

According to the voice transcription method provided by the embodiment of the invention, on the basis of the traditional scheme, a result fusion scheme is designed, the first transcription result output by the acoustic model and the second transcription result output by the decoding model are dynamically fused to obtain the final recognition result, the advantages of the traditional voice transcription thought and the end-to-end voice transcription thought based on the acoustic layer are taken into consideration, the voice transcription effect is effectively improved through the dynamic preference of the text generation model in the voice transcription engine, and the accuracy of the voice transcription is improved.

Fig. 7 is a general framework diagram of the voice transcription method provided by the present invention, and as shown in fig. 7, the method includes a training phase and an application phase.

The training phase includes training of the acoustic model, the decoding model, and the text generation model, which has been described in detail above and will not be described herein.

The application stage comprises the following steps:

step 710, determining audio to be transcribed;

step 720, determining the resource occupation ratio of the voice transcription engine based on the data specification of the audio to be transcribed, the energy consumption state of the equipment and/or the heating state of the equipment, or based on the data specification of the audio to be transcribed;

step 730, determining the operation resources of the voice transcription engine from the equipment calculation resources based on the resource proportion;

step 740, based on the acoustic model in the speech transcription engine, encoding the acoustic features of the audio to be transcribed, and determining a first transcription result based on the acoustic features;

step 750, decoding the acoustic features based on a decoding model in the speech transcription engine to obtain a second transcription result;

and 760, fusing the first transcription result and the second transcription result based on the voice transcription engine to obtain the transcription text of the audio to be transcribed.

The following describes the voice transcription apparatus provided by the present invention, and the voice transcription apparatus described below and the voice transcription method described above may be referred to correspondingly.

Fig. 8 is a schematic structural diagram of a voice transcription apparatus provided by the present invention, and as shown in fig. 8, the apparatus includes:

an audio determining unit 810 for determining an audio to be transcribed;

a voice transcription unit 820, configured to perform offline voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, where an operation resource of the voice transcription engine is determined based on a data specification of the audio to be transcribed.

The voice transcription device provided by the invention realizes the offline voice transcription which has stronger real-time performance and higher safety and does not depend on the network any more through the voice transcription engine stored locally. In the off-line voice transcription process, the operation resource of the voice transcription engine is determined according to the data specification of the audio to be transcribed, the transcription efficiency of the voice transcription engine under different load amounts and the power consumption of the intelligent device are considered, so that the relation between the power consumption of the intelligent device and the transcription efficiency can be balanced when the voice transcription engine calls the allocated operation resource to perform voice transcription on the audio to be transcribed, the power consumption of the intelligent device is reduced on the premise that the transcription efficiency is guaranteed, and the conditions of serious heating and poor endurance are relieved.

Based on the above embodiment, the apparatus further includes an operation resource determining unit, configured to:

Based on the above embodiment, the data specification is determined based on the data amount of the audio to be transcribed and the storage amount of the storage space, where the storage space is used for storing the audio to be transcribed.

Based on the above embodiment, the voice transcription unit 820 is configured to:

Based on the above embodiment, the apparatus further includes an acoustic model determination unit, configured to:

Fig. 9 is a schematic structural diagram of a sound recording apparatus provided by the present invention, and as shown in fig. 9, the apparatus includes a sound pickup apparatus 910 and an electronic apparatus 1000, where the sound pickup apparatus 910 is used for recording audio to be transcribed.

Fig. 10 illustrates a physical structure diagram of an electronic device, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a method of voice transcription comprising: determining audio to be transcribed; and performing offline voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, wherein the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program being capable of executing, when executed by a processor, a voice transcription method provided by the above methods, the method including: determining audio to be transcribed; and performing offline voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, wherein the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method of voice transcription provided by the above methods, the method comprising: determining audio to be transcribed; and performing offline voice transcription on the audio to be transcribed based on a locally stored voice transcription engine, wherein the operation resource of the voice transcription engine is determined based on the data specification of the audio to be transcribed.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of voice transcription, comprising:

determining audio to be transcribed;

2. The speech transcription method according to claim 1, wherein the computational resources of the speech transcription engine are determined based on the steps of:

3. The voice transcription method according to claim 1 or 2, characterized in that the data specification is determined based on the data amount of the audio to be transcribed and the storage amount of a storage space for storing the audio to be transcribed.

4. The voice transcription method according to claim 1 or 2, wherein the performing offline voice transcription on the audio to be transcribed based on the locally stored voice transcription engine comprises:

5. The method of claim 4, wherein the acoustic model is derived from distillation training of an original acoustic model based on the sample audio.

6. The method of claim 5, wherein the acoustic model is trained based on the following steps:

7. The method according to claim 4, wherein the fusing the first transcription result and the second transcription result based on the speech transcription engine to obtain the transcribed text of the audio to be transcribed comprises:

8. A speech transcription device, comprising:

an audio determining unit for determining an audio to be transcribed;

9. Recording device comprising a pickup device, a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the pickup device is adapted to record audio to be transcribed, and in that the processor, when executing the program, carries out the steps of the method for transcribing speech according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech transcription method as claimed in any one of claims 1 to 7.