CN114049885B

CN114049885B - Punctuation mark recognition model construction method and punctuation mark recognition model construction device

Info

Publication number: CN114049885B
Application number: CN202210030614.9A
Authority: CN
Inventors: 陈梦喆; 陈谦
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-04-22
Anticipated expiration: 2042-01-12
Also published as: CN114049885A

Abstract

The application discloses a punctuation mark recognition model construction method, a punctuation mark recognition model construction device and punctuation mark recognition equipment. Wherein the method comprises the following steps: acquiring a first text set and a first voice data set, and acquiring a corresponding relation set between second voice data and a second text; according to the first text set, learning to obtain network parameters of a text processing module included in the model; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; and training the voice processing module based on the first network parameters according to the corresponding relation set to obtain second network parameters of the voice processing module. By adopting the processing mode, the model has more consistent recognition accuracy in the general field, and simultaneously, the voice processing module is better learned from a small amount of parallel data covering less fields, so that the intention of a speaker can be better utilized after acoustic information is introduced, and punctuation marks more conforming to spoken language are obtained.

Description

Punctuation mark recognition model construction method and punctuation mark recognition model construction device

Technical Field

The application relates to the technical field of voice processing, in particular to a punctuation mark recognition model construction method, a punctuation mark recognition model construction device, a punctuation mark recognition model construction equipment, a voice transcription system and a voice interaction system.

Background

The voice transcription system is a voice processing system which can transcribe voice into characters. The system can automatically form a conference summary so as to improve conference efficiency, exert conference functions, avoid waste of manpower, material resources and financial resources, reduce conference cost and achieve manpower resource efficiency.

For the convenience of reading by the user, the text output by the real-time speech transcription system is usually text with punctuation marks. Spoken punctuation prediction is a task of determining punctuation for a transcribed text of speech. A typical spoken punctuation prediction method is that punctuation possibly appearing in a speech transcription text is predicted by comprehensively considering speech transcription text and corresponding speech acoustic characteristics through a pre-trained spoken punctuation recognition model. The corpus required by the training of the spoken punctuation mark recognition model needs to have audio and text labels at the same time.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: the label quantity of the parallel data is far smaller than that of the data of a pure text in the coverage of the field, and the training of the model by using a small quantity of parallel data of a limited field can lead to that a better recognition effect of the spoken punctuation marks can be obtained only in a part of fields. In summary, how to train a model with a small amount of parallel data with limited coverage area makes the model have consistent effect improvement in the general field, and becomes a problem that needs to be solved urgently by technical personnel in the field.

Disclosure of Invention

The application provides a punctuation mark recognition model construction method, which aims to solve the problem that the model in the prior art only has higher recognition accuracy in the field of parallel corpus coverage. The application additionally provides a punctuation mark recognition model construction device, electronic equipment, a voice transcription system and a voice interaction system.

The application provides a punctuation mark recognition model construction method, which comprises the following steps:

acquiring a first text set and a first voice data set, and acquiring a corresponding relation set between second voice data and a second text;

according to the first text set, learning to obtain network parameters of a text processing module included in the model; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model;

and training the voice processing module based on the first network parameters according to the corresponding relation set to obtain second network parameters of the voice processing module.

Optionally, the first text set and the first speech information set include text and speech information in a first domain and/or a first language, the corresponding relationship set includes text and speech information in a second domain and/or a second language, and the model is used to identify punctuation marks of the text transcribed by speech in the first domain and/or the first language.

Optionally, the text processing module includes a plurality of text feature extraction layers;

the input data of the text feature extraction layer comprises: the text features output by the last text feature extraction layer and the acoustic features output by the voice processing module.

Optionally, the second voice data includes voice data containing noise;

the voice processing module comprises: the device comprises an acoustic feature extraction module, an audio quality detection module and an acoustic feature adjustment module;

the audio quality detection module is used for acquiring audio quality data of the second voice data;

and the acoustic feature adjusting module is used for adjusting the acoustic features output by the acoustic feature extracting module according to the audio quality data, and taking the adjusted acoustic features as the input data of the corresponding text feature extracting layer.

Optionally, the voice processing module further includes: the acoustic feature conversion layers respectively correspond to the text feature extraction layers;

and the acoustic feature conversion layer is used for performing feature conversion on the adjusted acoustic features, and taking the converted acoustic features as input data of the corresponding text feature extraction layer.

Optionally, the audio quality detection module includes: the time-frequency characteristic extraction module and the audio quality determination module;

the time-frequency characteristic extraction module is used for extracting time-frequency characteristics from the second voice data;

and the audio quality determining module is used for acquiring the audio quality data according to the time-frequency characteristics.

Optionally, the voice processing module includes: the acoustic feature extraction module is used for respectively corresponding to the acoustic feature conversion layers of the text feature extraction layers;

and the acoustic feature conversion layer is used for performing feature conversion on the acoustic features output by the acoustic feature extraction module and taking the converted acoustic features as input data of the corresponding text feature extraction layer.

Optionally, the learning to obtain the network parameters of the text processing module included in the model according to the first text set includes:

removing punctuation marks in the first text;

taking the first text without punctuation marks as input data of a text processing module, and predicting the punctuation marks of the input text through the text processing module;

and adjusting the network parameters of the text processing module according to the difference between the predicted punctuation marks and the punctuation mark marking information of the first text.

Optionally, the learning to obtain the first network parameter of the speech processing module included in the model according to the first speech information set includes:

and learning to obtain a first network parameter of the voice processing module included by the model according to the first voice information set in a self-learning mode.

Optionally, the training the voice processing module based on the first network parameter according to the corresponding relationship set to obtain a second network parameter of the voice processing module includes:

taking the second text without punctuation marks as input data of a text processing module, taking second voice data corresponding to the second text as input data of a voice processing module, and predicting the punctuation marks of the input text through the model;

and adjusting the network parameters of the voice processing module according to the difference between the predicted punctuation marks and the punctuation mark marking information of the second text.

The present application further provides a punctuation mark recognition model construction device, including:

the data acquisition unit is used for acquiring a first text set and a first voice data set and a corresponding relation set between second voice data and a second text;

the pre-training unit is used for learning to obtain network parameters of the text processing module included by the model according to the first text set; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model;

and the tuning unit is used for training the voice processing module based on the first network parameter according to the corresponding relation set to obtain a second network parameter of the voice processing module.

The present application further provides an electronic device, comprising:

a voice acquisition device;

a speaker;

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and the equipment is powered on and runs the program of the voice interaction method through the processor.

The present application further provides a voice transcription system, comprising:

the server is used for receiving conference voice data sent by the conference terminal; acquiring a voice transcription text of the voice data; recognizing punctuation information of the voice transcription text according to the voice data and the voice transcription text through a punctuation recognition model; returning a voice transcription text with punctuation marks to the conference terminal;

the conference terminal is used for acquiring the voice data and displaying the voice transcription text with the punctuations;

wherein the model is constructed in the following way: acquiring a first text set and a first voice data set, and acquiring a corresponding relation set between second voice data and a second text; according to the first text set, learning to obtain network parameters of a text processing module included in the model; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; and training the voice processing module based on the first network parameters according to the corresponding relation set to obtain second network parameters of the voice processing module.

The present application further provides a voice interaction system, comprising:

the server is used for receiving a voice interaction request aiming at target voice data sent by the client; acquiring a voice transcription text of the voice data; recognizing punctuation information of the voice transcription text according to the voice data and the voice transcription text through a punctuation recognition model; determining voice reply information and/or voice instruction information according to the voice transcription text with punctuations; returning the voice reply information to the client;

the client is used for acquiring the target voice data; displaying the voice reply information and/or executing voice instruction information;

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

according to the construction method of the punctuation recognition model, a large amount of easily-obtained single data (a first text set and a first voice data set) covering more fields are used for pre-training the model to enable the model to obtain a better initial effect, then network parameter values of a text processing module are fixed, and parallel data (a corresponding relation set between second voice data and second text) covering less fields are used for fine-tuning network parameters of the voice processing module. Therefore, the effect of the model on the main signal source 'a large amount of texts covering more fields' is not changed essentially, so that the model has more consistent recognition accuracy in the general field. Meanwhile, the voice processing module is better learned from a small amount of parallel data covering less fields, the intention of a speaker can be better utilized after acoustic information is introduced, and punctuation marks more conforming to spoken language are obtained, so that the model can achieve higher punctuation recognition accuracy rate for more fields, and the model can have consistent recognition accuracy rate improvement in the general field.

According to the punctuation mark recognition method provided by the embodiment of the application, punctuation marks of a voice transcription text are recognized through the punctuation mark recognition model, so that the punctuation marks can be well recognized even if the text and parallel data in a model training stage belong to different fields.

According to the voice transcription system provided by the embodiment of the application, the punctuation mark recognition is carried out on the transcription text of the conference voice through the punctuation mark recognition model, so that even if the conference voice and the parallel data in the model training stage belong to different fields, the punctuation mark recognition accuracy can still be higher.

According to the voice interaction system provided by the embodiment of the application, punctuation mark recognition is carried out on the transcription text of the interactive voice through the punctuation mark recognition model, so that even if the interactive voice and parallel data in a model training stage belong to different fields, higher punctuation mark recognition accuracy can be achieved, and a better voice interaction effect is obtained.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a punctuation mark recognition model construction method provided by the present application;

FIG. 2 is a schematic model diagram of an embodiment of a punctuation mark recognition model construction method provided by the present application;

FIG. 3 is a schematic diagram of another model of an embodiment of a punctuation mark recognition model construction method provided by the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides a punctuation mark recognition model construction method and device, electronic equipment, a voice transcription system and a voice interaction system. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating an embodiment of a punctuation mark recognition model construction method according to the present application. The execution subject of the method is a punctuation mark recognition model construction device, and the device is usually deployed at a server, but is not limited to the server, and can be any equipment capable of implementing the method. The punctuation mark recognition model construction method provided by the embodiment comprises the following steps:

step S101: and acquiring a first text set and a first voice data set and a corresponding relation set between second voice data and second text.

The first text set comprises a plurality of first texts with punctuation marks, for example, the first text is' drug shortage patients cannot get up, the market is illegally manipulated to cause price to rise … … in recent years, and the problem of domestic medicine supply shortage is always concerned. Day-ahead, … … ". The first set of voice information includes a plurality of first voice data, and the first voice information may be voice data collected through a microphone. The first text and the first voice information have no corresponding relation, and both are single data.

The correspondence set includes correspondences between a plurality of second speech data and second text, each correspondence being referred to as a pair of parallel data. The second text is text marking data with punctuations of the second voice data. The second speech data is usually manually text-labeled to obtain a corresponding second text.

The first text set and the first voice data set can cover the basically unlimited field and language, and a large amount of first text and first voice data can be obtained respectively. The domain and language covered by the corresponding relation set are usually limited, and only a small number of corresponding relations can be obtained.

In one example, the first set of text and the first set of speech data may include text and speech data, respectively, of a first domain, the set of correspondences includes text and speech data of a second domain, and the model may be used to identify punctuation marks of the text of the speech transcription of the first domain and the second domain. In specific implementation, the first text set may also include the text of the second domain at the same time, that is, the first text set covers the multi-domain text. The first set of speech data may or may not include speech data of the second domain. The text and speech data in the first text set and the first speech data set need not be parallel data. For example, the first text set and the first voice information set comprise texts and voice data of a computer conference and a government conference, the corresponding relation set comprises only voice data and corresponding texts of the government conference, and a model obtained through training based on the three data sets can obtain relatively consistent recognition accuracy when punctuation marks of government conference contents and computer conference contents are recognized through the model.

In another example, the first set of text and the first set of speech data may include text and speech data, respectively, in a first language, the set of correspondence includes text and speech data in a second language, and the model may be used to identify punctuation marks of the text transcribed from speech in the first language and the second language. In specific implementation, the first text set may also include the text of the second language at the same time, that is, the first text set covers multi-language text. The first speech data set may or may not include speech data in the second language domain. The text and speech data in the first text set and the first speech data set need not be parallel data. For example, the first text set and the first voice data set comprise texts and voice data of an English conference and a Chinese conference, the corresponding relation set comprises the voice data and the corresponding texts of the Chinese conference, and models obtained by training based on the three data sets can obtain more consistent recognition accuracy when punctuation marks of the contents of the English conference and the Chinese conference are recognized through the models.

In yet another example, the first set of text and the first set of speech data may include text and speech data in a first domain and a first language, respectively, the set of correspondences includes text and speech data in a second domain and a second language, and the model may be used to identify punctuation marks in the text transcribed from speech in the first domain and in the first language. The first set of text and first set of speech data may also include text and speech data in a second domain and in a second language, but these text and speech information need not be parallel data. For example, the first text set and the first voice data set comprise texts and voice data of an English computer conference and a Chinese government conference, the corresponding relation set comprises voice data and corresponding texts of the Chinese government conference, and a model obtained by training based on the three data sets can obtain more consistent recognition accuracy when punctuation marks of the English computer conference and the Chinese government conference are recognized through the model.

Step S103: according to the first text set, learning to obtain network parameters of a text processing module included in the model; and according to the first voice data set, learning to obtain a first network parameter of the voice processing module included by the model.

And the model is used for predicting punctuation marks appearing in the text according to the voice data and the voice transcription text. The model includes a speech processing module and a text processing module. The input data of the text processing module is a text without punctuation marks, and the input data of the voice processing module is voice data. The voice processing module can extract and output high-order acoustic features through multiple layers of features. The text processing module can extract text features through multiple layers of features, and simultaneously introduces high-order acoustic features output by the voice processing module. The text processing module predicts punctuation included in the text based on the text features and the higher-order acoustic features.

In the process of implementing the present invention, the inventor finds that the common way of introducing the sound information in the prior art is to adjust the model parameters globally, which results in the model fitting to the domain covered by the parallel data and the language style. In order to solve the problem, the method provided by this embodiment trains the model in two steps, the first step is to train the model globally based on a large amount of single data (the first text set and the first speech data set) covering more fields through step S103, and the second step is to optimize the speech processing module therein based on a small amount of parallel data covering less fields through step S105, so that the model can effectively obtain the benefit brought by the audio information, but at the same time, the effect generated for the main signal source text does not have substantial change, so that the model does not fit to the fields and language styles covered by the parallel data.

Step S103 is to pre-train the model, so that the model obtains a better initial effect. In the pre-training stage, the network parameters of the text processing module and the network parameters of the voice processing module are respectively trained according to the single data.

In an example, the learning of the network parameters of the text processing module included in the model according to the first text set may be implemented as follows: removing punctuation marks in a first text, taking the first text with the punctuation marks removed as input data of a text processing module, predicting the punctuation marks of the input text through the text processing module, comparing the predicted punctuation marks with punctuation mark information of the first text, calculating model loss, adjusting network parameters of the text processing module according to the model loss until the optimization target of the model is reached, and obtaining the network parameters of the text processing module.

In one example, learning the first network parameters of the speech processing module included in the model according to the first speech data set may be implemented as follows: and learning to obtain a first network parameter of the voice processing module included by the model according to the first voice data set in a self-learning mode. The self-learning approach may be to perform the modeling task by predicting the masked portions of speech units, i.e. to cover a certain proportion of the audio frames for the input speech data, with the goal of reproducing the frames at these positions. By adopting the self-learning method without marking, better acoustic information representation can be obtained.

Step S105: and training the voice processing module based on the first network parameters according to the corresponding relation set to obtain second network parameters of the voice processing module.

Step S105 is to perform tuning training on the voice processing module, and in this stage, the network parameter value of the text processing module can be fixed, and the network parameter of the voice processing module is tuned by using parallel data (the corresponding relation set) covering less fields.

In one example, step S105 can be implemented as follows: 1) taking the second text without punctuation marks as input data of a text processing module, taking second voice data corresponding to the second text as input data of a voice processing module, and predicting the punctuation marks of the input text through the model; 2) and adjusting the network parameters of the voice processing module according to the difference between the predicted punctuation marks and the punctuation mark marking information of the second text. In specific implementation, the predicted punctuation marks and punctuation marks of the second text can be compared, model loss is calculated, network parameters of the voice processing module are adjusted according to the model loss until the optimization target of the model is reached, and then the second network parameters of the voice processing module are obtained.

In one example, the text processing module may include a plurality of text feature extraction layers; the input data of the text feature extraction layer comprises: the text features output by the last text feature extraction layer and the acoustic features output by the voice processing module. Therefore, the text and the voice can be fused on each layer of the model, so that the text and the voice can be fused more fully, and the recognition accuracy can be effectively improved.

As shown in fig. 2, the text processing module includes an embedding layer and 4 transform layers, and the speech processing module includes a linear translation layer and multiple transform layers. In the text processing module, the input data of the embedding layer is text without punctuation marks, and the output data O of the embedding layer₁Is a word vector for the text;then passing through 4 layers of Transformer networks, and the output data of each layer are O respectively₂、O₃、O₄、O₅, O₅Is a text feature output by the text processing module, the text feature O₅The method is a text representation beneficial to punctuation judgment.

In the speech processing module, the input data of the linear conversion layer is a preliminary speech signal characterization. The voice signal feature may be a physical quantity (e.g., pause information, pitch, energy, spectral feature) of voice acoustic characteristics extracted from the voice data by the audio feature extraction module, or an acoustic feature derived from the voice data directly included in the training data. The audio feature extraction module may also be an audio feature extraction module implemented based on a multi-layer model, so that the input data of the linear conversion layer is high-order information (such as wav2 vec) extracted by the multi-layer model. The preliminary voice signal representation is input into a plurality of layers of transform networks after linear transformation, and is subjected to feature extraction through the plurality of layers of transform networks, so that a high-order sound representation S is further abstracted. As shown in fig. 2, the S-vector output by the speech processing module is added to each text feature extraction layer of the text processing module as input data of each text feature extraction layer.

In specific implementation, considering that the number of the modeling units of the voice data and the text cannot be consistent (the former is a voice frame, and the latter is a word), alignment needs to be performed during fusion, the alignment mode can adopt a forcealign mode to align the voice data and the text, and also can adopt a mode that the model learns the corresponding probability of each word and each frame to perform alignment.

In one example, the text processing module may include a plurality of text feature extraction layers; the voice processing module comprises: the acoustic feature extraction module is used for respectively corresponding to the acoustic feature conversion layers of the text feature extraction layers; and the acoustic feature conversion layer is used for performing feature conversion on the acoustic features output by the acoustic feature extraction module and taking the converted acoustic features as input data of the corresponding text feature extraction layer. Therefore, the fine adjustment of the voice processing module can adjust the self weight more fully according to the information of the text from the shallow layer to the deep layer, and the recognition accuracy can be further improved.

As shown in FIG. 3, the acoustic features S output by the speech processing module are multiplied by different weighting layers C_i-1And then fusing with the text-related layer. In specific implementation, the following calculation formula can be adopted: o is_i=Relu((O_i-1+S*C_i-1)*W_i). The formula represents: output data O of ith text feature extraction layer_iThe input data of the layer can be multiplied by the weight layer W_iAnd then obtaining the product by activating the function. And the input of the text feature extraction layer consists of two parts: output data O of which one part is a previous text feature extraction layer_i-1(ii) a The other part is the basic output data S of the speech processing module, which is then multiplied by different weighting layers C_i-1And then fusing with the text-related layer. In specific implementation, the manner of fusing the text feature and the acoustic feature may be feature concatenation or feature addition.

In another example, the model may also incorporate audio quality data and, correspondingly, the second speech data may comprise speech data including noise. In this case, the voice processing module includes: the device comprises an acoustic feature extraction module, an audio quality detection module and an acoustic feature adjustment module. The audio quality detection module is used for acquiring audio quality data of the second voice data; and the acoustic feature adjusting module is used for adjusting the acoustic features output by the acoustic feature extracting module according to the audio quality data, and taking the adjusted acoustic features as the input data of the corresponding text feature extracting layer.

As shown in fig. 3, the model incorporates an audio quality detection module that gives a speech quality score that is multiplied in a soft-switching like manner to the acoustic features output by the speech processing module, controlling the audio retention ratio at different audio qualities. In specific implementation, the following calculation formula can be adopted: o is_i=Relu((O_i-1+S*Q *C_i-1)*W_i) Where Q is the audio quality data.

The audio quality detection module may include: the device comprises a time-frequency characteristic extraction module and an audio quality determination module. A time-frequency feature extraction module, configured to extract a time-frequency feature from the second voice data, such as STFT (short time fourier transform) data shown in fig. 3; and the audio quality determining module is used for acquiring the audio quality data according to the time-frequency characteristics. The audio quality determination module may be implemented based on a neural network.

After the voice data is processed by the audio quality detection module, the output data is a score Q of sound quality 0 to 1. 1 means good quality and 0 means poor quality. The audio quality detection module is introduced to address the situation that the signal-to-noise ratio of the sound possibly existing in the actual scene is too low or even non-speech. In fact, the training data cannot completely cover the situation, in order to avoid the risk caused by abnormal data, such a design similar to soft switching is made, the final vector S is multiplied by Q to serve as the output of the audio module, when the audio quality is poor, Q =0 is equivalent to masking the sound information, when Q =1 is equivalent to completely retaining the sound information, and when Q is a certain value of 0-1, the sound information is retained to different degrees.

The audio quality detection module can add high noise to the voice data with better quality and add non-voice as training data, so that the generated data with labels can be used for training. By adopting the processing mode, the coverage of the available parallel data for training on the audio quality diversity is more sufficient, and higher system robustness can be still ensured after the sound information is introduced.

It can be seen from the above embodiments that, the punctuation mark recognition model construction method provided in the embodiments of the present application performs pre-training on the model by using a large amount of easily-obtained single data (first training sample set) covering more fields, so that the model obtains a better initial effect, then fixing the network parameter value of the text processing module, using parallel data (a second training sample set) covering less fields to fine-tune the network parameter of the voice processing module, thus, the model can better utilize the self intention of the speaker after utilizing the acoustic characteristic information to obtain the punctuation marks which are more in line with the spoken language, but the effect generated by the main signal source text is not changed essentially, so that a model which can achieve higher punctuation identification accuracy rate for more fields can be learned from a small amount of parallel data covering less fields, and the model can have consistent identification accuracy rate improvement in the general field.

Second embodiment

In the foregoing embodiment, a punctuation mark recognition model construction method is provided, and correspondingly, the present application further provides a punctuation mark recognition model construction device. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a punctuation mark recognition model construction device includes: the device comprises a data acquisition unit, a pre-training unit and an adjusting and optimizing unit.

The data acquisition unit is used for acquiring a first text set and a first voice data set and a corresponding relation set between second voice data and a second text; the pre-training unit is used for learning to obtain network parameters of the text processing module included by the model according to the first text set; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; and the tuning unit is used for training the voice processing module based on the first network parameter according to the corresponding relation set to obtain a second network parameter of the voice processing module.

In one example, the first set of text may include text in a first domain and text in a second domain, the first set of speech information may include speech information in the first domain, the set of correspondences includes text and speech information in the second domain, and the model is used to identify punctuation marks in the text transcribed from speech in the first domain and the second domain.

In one example, the first set of text may include text in a first language and text in a second language, the first set of speech information may include speech information in the first language, the set of correspondences includes text and speech information in the second language, and the model is configured to identify punctuation marks in the text transcribed from speech in the first language and the second language.

In one example, the first text set may include a text in a first domain and a first language, a second domain and a second language, the first speech information set may include speech information in the first domain and the first language, the set of correspondence relationships may include a text in the second domain and a second language and speech information, and the model may be used to recognize punctuations of the text transcribed by speech in the first domain and the first language and may be used to recognize punctuations of the text transcribed by speech in the second domain and the second language.

In one example, the text processing module includes a plurality of text feature extraction layers; the input data of the text feature extraction layer comprises: the text features output by the last text feature extraction layer and the acoustic features output by the voice processing module.

In one example, the second speech data includes speech data including noise; the voice processing module comprises: the device comprises an acoustic feature extraction module, an audio quality detection module and an acoustic feature adjustment module; the audio quality detection module is used for acquiring audio quality data of the second voice data; and the acoustic feature adjusting module is used for adjusting the acoustic features output by the acoustic feature extracting module according to the audio quality data, and taking the adjusted acoustic features as the input data of the corresponding text feature extracting layer.

In one example, the speech processing module further comprises: the acoustic feature conversion layers respectively correspond to the text feature extraction layers; and the acoustic feature conversion layer is used for performing feature conversion on the adjusted acoustic features, and taking the converted acoustic features as input data of the corresponding text feature extraction layer.

In one example, the audio quality detection module includes: the time-frequency characteristic extraction module and the audio quality determination module; the time-frequency characteristic extraction module is used for extracting time-frequency characteristics from the second voice data; and the audio quality determining module is used for acquiring the audio quality data according to the time-frequency characteristics.

In one example, the speech processing module comprises: the acoustic feature extraction module is used for respectively corresponding to the acoustic feature conversion layers of the text feature extraction layers; and the acoustic feature conversion layer is used for performing feature conversion on the acoustic features output by the acoustic feature extraction module and taking the converted acoustic features as input data of the corresponding text feature extraction layer.

In one example, the pre-training unit is specifically configured to remove punctuation marks in the first text; taking the first text without punctuation marks as input data of a text processing module, and predicting the punctuation marks of the input text through the text processing module; and adjusting the network parameters of the text processing module according to the difference between the predicted punctuation marks and the punctuation mark marking information of the first text.

In an example, the pre-training unit is specifically configured to learn, in a self-learning manner, according to the first speech information set, to obtain the first network parameter of the speech processing module included in the model.

In an example, the tuning unit is specifically configured to use the second text with punctuation removed as input data of a text processing module, use second speech data corresponding to the second text as input data of a speech processing module, and predict punctuation of the input text through the model; and adjusting the network parameters of the voice processing module according to the difference between the predicted punctuation marks and the punctuation mark marking information of the second text.

Third embodiment

In the foregoing embodiment, a punctuation mark recognition model construction method is provided, and correspondingly, the application further provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the punctuation mark recognition model construction method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: acquiring a first text set and a first voice data set, and acquiring a corresponding relation set between second voice data and a second text; according to the first text set, learning to obtain network parameters of a text processing module included in the model; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; and training the voice processing module based on the first network parameters according to the corresponding relation set to obtain second network parameters of the voice processing module.

Fourth embodiment

In the above embodiment, a punctuation mark recognition model construction method is provided, and correspondingly, the present application further provides a voice transcription system. Since the system embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a voice transcription mutual system, comprising: a server and a conference terminal. The conference terminal includes but is not limited to: a sound pick-up, a hands-free phone, a video conference terminal, etc.

The conference terminal is used for acquiring the voice data and displaying the voice transcription text with the punctuations; the server is used for receiving conference voice data sent by the conference terminal; acquiring a voice transcription text of the voice data; recognizing punctuation information of the voice transcription text according to the voice data and the voice transcription text through a punctuation recognition model; and returning the voice transcription text with the punctuation marks to the conference terminal, and displaying the text through the conference terminal, such as on a large screen of a conference site.

As can be seen from the foregoing embodiments, in the voice transcription provided in the embodiments of the present application, punctuation recognition is performed on a transcription text of conference voice by using the punctuation recognition model constructed in the first embodiment, so that even if the conference voice and parallel data in the model training phase belong to different fields, a higher punctuation recognition accuracy can be achieved.

Fifth embodiment

In the above embodiment, a punctuation mark recognition model construction method is provided, and correspondingly, the application further provides a voice interaction system. Since the system embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a voice interaction system, comprising: a server and a client. The client includes but is not limited to: terminal equipment such as personal computer, panel computer, smart mobile phone, smart sound box.

The server is used for receiving a voice interaction request aiming at target voice data sent by the client; acquiring a voice transcription text of the voice data; recognizing punctuation information of the voice transcription text according to the voice data and the voice transcription text through a punctuation recognition model; determining voice reply information and/or voice instruction information according to the voice transcription text with punctuations; returning the voice reply information and/or the voice instruction information to the client; the client is used for acquiring the target voice data; and displaying the voice reply information and/or executing the voice instruction information.

The voice reply message can be a text reply message, a voice reply message or other forms of reply messages.

In one example, the client is a smart speaker that collects user voice data, such as "tianmao puck that adjusts the air conditioner temperature higher," by which the system can determine that the voice command information is "air conditioner: the temperature is more than 25 degrees, and the intelligent sound box can execute the instruction and adjust the air conditioner to be more than 25 degrees.

As can be seen from the foregoing embodiments, in the voice interaction system provided in the embodiment of the present application, punctuation recognition is performed on a transcribed text of an interactive voice through the punctuation recognition model constructed in the first embodiment, so that even if the interactive voice and parallel data in a model training phase belong to different fields, a higher punctuation recognition accuracy can be achieved, thereby obtaining a better voice interaction effect.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A punctuation mark recognition model construction method is characterized by comprising the following steps:

according to the first text set, learning to obtain network parameters of a text processing module included in the model; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; the input data of the voice processing module is voice data, and high-order acoustic features are extracted and output through multiple layers of features; the input data of the text processing module is a text without punctuation marks, the data is extracted through multiple layers of features and is taken as text features, and the text processing module predicts the punctuation marks included by the text based on the text features and high-order acoustic features;

and training the voice processing module based on the first network parameter according to the corresponding relation set to obtain a second network parameter of the voice processing module, wherein the constructed model comprises the voice processing module based on the second network parameter and the text processing module, and the network parameter of the text processing module is a parameter obtained by learning according to the first text set.

2. The method of claim 1, wherein the first set of text and the first set of speech data comprise text and speech information in a first domain and/or a first language, wherein the set of correspondences comprises text and speech information in a second domain and/or a second language, and wherein the model is configured to identify punctuation marks in the text transcribed from speech in the first domain and/or the first language.

3. The method of claim 1,

the text processing module comprises a plurality of text feature extraction layers;

4. The method of claim 3,

the second voice data includes voice data containing noise;

5. The method of claim 4,

the voice processing module further comprises: the acoustic feature conversion layers respectively correspond to the text feature extraction layers;

6. The method of claim 4,

the audio quality detection module includes: the time-frequency characteristic extraction module and the audio quality determination module;

7. The method of claim 3,

the voice processing module comprises: the acoustic feature extraction module is used for respectively corresponding to the acoustic feature conversion layers of the text feature extraction layers;

8. The method of claim 1, wherein learning the network parameters of the text processing modules included in the model from the first set of text comprises:

removing punctuation marks in the first text;

9. The method of claim 1, wherein learning the first network parameters of the speech processing modules included in the model from the first speech data set comprises:

and learning to obtain a first network parameter of the voice processing module included by the model according to the first voice data set in a self-learning mode.

10. The method of claim 1, wherein the training the speech processing module based on the first network parameter according to the corresponding set of relationships to obtain a second network parameter of the speech processing module comprises:

11. A punctuation mark recognition model construction device, comprising:

the pre-training unit is used for learning to obtain network parameters of the text processing module included by the model according to the first text set; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; the input data of the voice processing module is voice data, and high-order acoustic features are extracted and output through multiple layers of features; the input data of the text processing module is a text without punctuation marks, the data is extracted through multiple layers of features and is taken as text features, and the text processing module predicts the punctuation marks included by the text based on the text features and high-order acoustic features;

and the tuning unit is used for training the voice processing module based on the first network parameter according to the corresponding relation set to obtain a second network parameter of the voice processing module, the constructed model comprises the voice processing module based on the second network parameter and the text processing module, and the network parameter of the text processing module is a parameter obtained by learning according to the first text set.

12. An electronic device, comprising:

a voice acquisition device;

a speaker;

a processor; and

a memory for storing a program for implementing the punctuation mark recognition model construction method according to any one of claims 1 to 10, the device being powered on and the program for executing the method by said processor.

13. A voice transcription system, comprising:

wherein the model is constructed in the following way: acquiring a first text set and a first voice data set, and acquiring a corresponding relation set between second voice data and a second text; according to the first text set, learning to obtain network parameters of a text processing module included in the model; according to a first voice data set, learning to obtain a first network parameter of a voice processing module included by the model; the input data of the voice processing module is voice data, and high-order acoustic features are extracted and output through multiple layers of features; the input data of the text processing module is a text without punctuation marks, the data is extracted through multiple layers of features and is taken as text features, and the text processing module predicts the punctuation marks included by the text based on the text features and high-order acoustic features; and training the voice processing module based on the first network parameter according to the corresponding relation set to obtain a second network parameter of the voice processing module, wherein the constructed model comprises the voice processing module based on the second network parameter and the text processing module, and the network parameter of the text processing module is a parameter obtained by learning according to the first text set.

14. A voice interaction system, comprising:

the server is used for receiving a voice interaction request aiming at target voice data sent by the client; acquiring a voice transcription text of the voice data; recognizing punctuation information of the voice transcription text according to the voice data and the voice transcription text through a punctuation recognition model; determining voice reply information and/or voice instruction information according to the voice transcription text with punctuations; returning the voice reply information and/or the voice instruction information to the client;