CN115132174A

CN115132174A - Voice data processing method and device, computer equipment and storage medium

Info

Publication number: CN115132174A
Application number: CN202210705099.XA
Authority: CN
Inventors: 张欢韵
Original assignee: Shenzhen Huace Huihong Technology Co ltd
Current assignee: Shenzhen Huace Huihong Technology Co ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-09-30

Abstract

The embodiment of the application discloses a voice data processing method, a device, computer equipment and a storage medium, wherein the voice data processing method comprises the following steps: performing voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information; carrying out posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data; and determining the pronunciation standard degree of the target voice data according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data. By adopting the embodiment of the application, the reference text is not required to be provided in advance, and can be obtained through posterior error correction processing, so that the pronunciation standard degree of any voice data can be accurately identified, and the scene universality is improved.

Description

Voice data processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a voice data processing method, a voice data processing apparatus, a computer device, and a computer-readable storage medium.

Background

With the development of artificial intelligence technology, the speech recognition technology has made remarkable progress and has been widely applied to various speech interaction scenes, such as spoken language examination, mandarin assessment, man-machine communication and the like, thereby bringing great convenience to people.

In some scenarios where the pronunciation criterion of the speaker needs to be determined, a specific test text is given, and the corresponding pronunciation level and grade of the speaker are determined by processing the speech of the test text. However, in this method, due to the limitation of the test text, there may be a situation that the pronunciation level cannot be determined for any section of speech of the speaker, and it is difficult to obtain the pronunciation standard degree of the speaker in the natural state, so it is necessary to explore a new processing mechanism to solve such a problem.

Disclosure of Invention

The embodiment of the application provides a voice data processing method and device, computer equipment and a storage medium, a reference text is not required to be provided in advance, and the reference text can be obtained through posterior error correction processing, so that the pronunciation standard degree of any voice data can be accurately identified, and the scene universality is improved.

In one aspect, an embodiment of the present application provides a method for processing voice data, including:

performing voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information;

carrying out posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data;

and determining the pronunciation standard degree of the target voice data according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data.

In one aspect, an embodiment of the present application provides a voice data processing apparatus, including:

the processing module is used for carrying out voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information;

the processing module is used for carrying out posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data;

and the determining module is used for determining the pronunciation standard degree of the target voice data according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data.

Accordingly, an embodiment of the present application provides a computer device, including: a processor, a memory, and a network interface; the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the voice data processing method in the embodiment of the application.

Accordingly, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method for processing voice data in embodiments of the present application is performed.

Accordingly, embodiments of the present application provide a computer program product, which includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the method for processing voice data of the embodiments of the present application is implemented.

In the embodiment of the application, the speech recognition processing on any speech data is supported, a speech recognition result containing text information and pronunciation information is obtained, a reference text can be determined based on the posterior error correction processing on the text information, the reference text can be used as reference data for evaluating the pronunciation standard of the target speech data, and the pronunciation standard degree of the target speech data can be determined by the reference pronunciation information of the reference text and the pronunciation information obtained through recognition. Therefore, in the process, the reference text does not need to be given in advance, but the mode that after the text is acquired by simulating the human brain through posterior error correction processing, the text can be automatically corrected to obtain the correct text is adopted, so that the reference text corresponding to the target voice data is acquired. Because any voice data can obtain the corresponding reference text in the processing process, the pronunciation standard degree of a section of voice cannot be accurately identified due to the content limitation given in advance by the reference text, so that the method and the device can be suitable for accurately evaluating the pronunciation standard degree of the voice of a speaker in any natural state, and the universality under various scenes is improved.

Drawings

FIG. 1 is an architecture diagram of a voice data processing system according to an embodiment of the present application;

fig. 2 is a first flowchart illustrating a voice data processing method according to an embodiment of the present application;

fig. 3 is a flowchart illustrating a speech data processing method according to an embodiment of the present application;

fig. 4 is a third schematic flowchart of a voice data processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an effect of outputting a prompt according to an embodiment of the present disclosure;

fig. 6 is a flowchart for evaluating the pronunciation criterion and outputting a prompt according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to better understand the scheme of the embodiments of the present application, the following first introduces the related terms and concepts that may be involved in the embodiments of the present application.

Automatic speech recognition technology: the Automatic Speech Recognition is called ASR for short. Techniques for automatically converting speech content to text by a machine.

CTC: connection terminal Temporal Classification, guidelines, and variants thereof. An algorithm commonly used in the fields of speech recognition, text recognition and the like is used for solving the problems of inconsistent input and output sequence lengths and incapability of aligning.

And (3) MFCC: mel Frequency Cepstrum Coefficient, Mel Frequency cepstral Coefficient. A non-linear frequency scale based on the sensory judgment of the human ear on equidistant pitch (pitch) changes is often used as a speech feature in speech recognition.

Levenshtein distance: also known as the levenstein distance, is an edit distance. Refers to the minimum number of editing operations required between two character strings to change from one character string to another. The allowed editing operations include replacing one character with another, inserting one character, and deleting one character.

Phoneme: and the minimum voice unit is divided according to the natural attribute of the voice. Phonemes can be described in terms of pronunciation actions, e.g., ma contains two pronunciation actions, m and a, being two phonemes. The voice movement refers to the movement required for making voice, such as the voice movement is used as closing of the upper lip and the lower lip, the vocal cords vibrate, and airflow flows out of the nasal cavity to make voice.

Based on the above terms and concepts, the architecture of the speech data processing system provided by the embodiments of the present application will be described with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is an architecture diagram of a voice data processing system according to an embodiment of the present application. As shown in fig. 1, the voice data processing system includes a terminal device 100 and a server 101, and the terminal device may establish a communication connection with the server 101 in a wired or wireless manner.

The terminal device 100 is used to collect target voice data, which may be voice data of any sentence or a paragraph of a speech of a speaker in any state. The terminal device 100 may transmit the acquired target voice data to the server 101, and the server 101 processes the target voice data. In one implementation, the terminal device 100 may receive the data processing result (e.g., the voice recognition result, the pronunciation standard degree) returned by the server 101 and output to prompt the evaluation of the pronunciation data of the speaker. The recognized text and pinyin, as well as the pronunciation criteria of the target speech data, are output, for example, in the terminal device 100 to give ratings and prompt information for the speaker to view. It should be noted that the terminal devices include, but are not limited to: the mobile phone, the computer, the intelligent voice interaction equipment, the intelligent household electrical appliances, the vehicle-mounted terminal, the aircraft and other equipment are not limited in the application. The number of the terminal devices is not limited in the present application.

The server 101 is used to process target voice data. This processing includes, but is not limited to: performing speech recognition processing on target speech data, performing posterior error correction processing on text information obtained by speech recognition, performing calculation processing on pronunciation criteria, and the like. The text information obtained by speech recognition can be text in any one or more languages, such as text containing Chinese, text containing English, or text containing Chinese and English; the pronunciation information obtained by the speech recognition is the original pronunciation data of the target speech data, and may correspond to text information, such as a chinese text, and then the pronunciation information may be pinyin information corresponding to each word or phoneme information, such as an english text, and then the pronunciation information may be phonetic symbol information corresponding to each word.

In one embodiment, the speech recognition process and the a posteriori error correction process may be implemented using corresponding speech processing models, such as a text recognition model for processing the target speech data to obtain text information and a pinyin recognition model for processing the target speech data to obtain pinyin information. The posterior error correction processing can judge the voice recognition effect from the semantic level by using a semantic recognition model, and different processing modes can be selected according to different voice recognition effects so as to obtain a reference text serving as an evaluation reference basis. For example, if the speech recognition effect is not satisfactory, the text information obtained by speech recognition may be further processed by using a language Representation model, which may be a pre-trained BERT (Bidirectional Encoder Representation from Transformers) model, and then determined whether the text information can be used as a reference text.

The reference pronunciation information of the reference text is standard pronunciation information of the reference text, and in one embodiment, the reference pronunciation information of the reference text can be used as a calculation basis to determine the similarity between the reference pronunciation information and the pronunciation information, so as to obtain the pronunciation standard degree of the target voice data. Further, the server 101 may transmit the pronunciation criterion to the terminal device 100 to display the pronunciation criterion in the terminal device 100. In addition, text information (e.g. characters) and pronunciation information (e.g. pinyin) obtained by voice recognition can also be sent to the terminal device 100 for display.

It should be noted that the server 101 may be an independent physical server, may also be a server cluster or distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto. The number of servers is not limited in this application.

The voice data processing system provided by the embodiment of the application can be applied to scenes such as interview, quality inspection customer service, Putonghua review, spoken language assessment and the like, and can be used for judging the pronunciation standard degree of the voice of the user in any state. The voice data processing system supports voice recognition processing on any voice data to obtain a voice recognition result containing text information and pronunciation information, can determine a reference text based on the posterior error correction processing on the text information, can be used as reference data for evaluating the pronunciation standard of target voice data, and can specifically determine the pronunciation standard degree of the target voice data through the reference pronunciation information of the reference text and the pronunciation information obtained through recognition. Because the posterior error correction process simulates the mode that the human brain can automatically correct the text to obtain the correct text after obtaining the text, the reference text is not required to be preset, any voice data can obtain the corresponding reference text, and the pronunciation standard degree of a section of voice cannot be accurately identified due to the content limitation preset by the reference text, so that the method can be suitable for accurately evaluating the pronunciation standard degree of the voice of a speaker in any natural state, and improves the universality in various scenes.

Referring to fig. 2, fig. 2 is a first flowchart illustrating a voice data processing method according to an embodiment of the present application. The voice data processing method may be performed by a computer device (e.g., the server 101 in fig. 1). The voice data processing method may include the following.

S201, carrying out voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data.

The target voice data is voice data to be processed, and voice sent by a speaker can be collected in real time through related equipment (for example, a recording device carried by the terminal device, such as a microphone) and sent to the server. Or any section of voice data acquired from a voice database, wherein the voice database contains massive voice data uploaded offline. The target speech data may be any piece of speech information (e.g., chinese, english) in a specified language. The language contained in the target speech data is not limited herein.

And performing voice recognition on the target voice data, wherein the obtained voice recognition result comprises text information and pronunciation information. The text information is data for describing a text of the target speech data, such as text content of each sentence, the pronunciation information is data for describing pronunciation of the speech, and the pronunciation information is an original pronunciation of the user and does not necessarily coincide with a standard pronunciation of the text. The pronunciation information may be pinyin information including tones and phonemes. Or may be phoneme information. The pronunciation information and the text information may correspond, for example, the text information is a chinese text including 50 words, and the pronunciation information may be a pinyin for each word. For example, the text result (i.e. text information) obtained by performing speech recognition on the target speech data is: "Huashengyi is good and has a special taste", the phonetic result (i.e. phonetic information) obtained by speech recognition is: "hua 2sheng1 zheng1 ha 3, you3 yi4 zhong3te 4bie2 de5 wei4 dao 4", wherein the numbers 1, 2 and the like indicate the tones of pinyin, specifically, 1 indicates one sound, 2 indicates two sounds, 3 indicates three sounds, 4 indicates four sounds, and 5 indicates a soft sound.

In one implementation, the speech recognition processing on the target speech data may include text recognition processing by which text information is obtainable and pronunciation recognition processing by which pronunciation information is obtainable. Can be as follows: and calling the voice recognition model to perform voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data. Optionally, the speech recognition model includes a text recognition model and a pronunciation recognition model, and the text recognition model is configured to perform text recognition processing on the target speech data to obtain text information of the target speech data. The text recognition model may adopt a CTC model or an RNN driver model (a modification of the CTC model), and the text information in the output speech recognition result may include text after model error correction. The pronunciation recognition model is used for carrying out pronunciation recognition processing on the target voice data to obtain pronunciation information of the target voice data, wherein the pronunciation information is the original pronunciation of the user obtained through recognition. Since the pronunciation labels of different languages are different, for example, Chinese is labeled by pinyin and English is labeled by phonetic symbols, the pronunciation information can correspond to the language contained in the target voice data. When the target speech data contains speech information in chinese, the pronunciation information may include pinyin or phonemes, and when the target speech data contains speech information in english, the pronunciation information may include phonetic symbols or phonemes. Taking the obtained pronunciation information as the pinyin information, the processing procedure of the target voice data by using the pronunciation recognition model may include the following contents: extracting the characteristics in the target voice data by adopting voice signal characteristic extraction methods such as MFCC (Mel frequency cepstrum coefficient) and the like, combining adjacent frames into phonemes through an acoustic model, and combining the phonemes into pinyin. In summary, the output speech recognition result of the speech recognition model includes two parts: textual information and pronunciation information. For example, Chinese speech information, and the speech recognition result obtained by the speech recognition model may include text and pinyin for each sentence. And automatically carrying out voice recognition on the target voice data through the support of the voice recognition model to obtain a voice recognition result.

S202, carrying out posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data.

And for the text information included in the voice recognition result, the reference text corresponding to the target voice data can be obtained through posterior error correction processing. The posterior error correction process simulates an error correction mechanism of human brain, and can correct inaccurate text information and acquire correct text corresponding to pronunciation information. The posterior error correction processing is a posterior preprocessing, and the processing granularity of the posterior error correction processing can be obtained by text sentence break on text information, can be a sentence of text (possibly containing a plurality of commas or other symbols), can also be a multi-sentence of text, and can also be a text separated by punctuation marks. The reference text obtained through the posterior error correction process, which is a text used as a scoring reference, can be used to judge the pronunciation standard degree of the target speech data. In this embodiment, the reference text may also be referred to as standard text.

In one embodiment, the posterior error correction processing stage may first determine the speech recognition effect of the text message, and then adopt different processing strategies to process the text message according to different speech recognition effects. The speech recognition effect can be determined by whether the semantic meaning of the sentence is complete or not. Semantic meaning is used herein to mean the meaning of a context, including language meaning as well as sentence structure. And processing the text with complete semanteme and incomplete semanteme by adopting different processing strategies to obtain a reference text. Reference will be made in detail to the following description of embodiments, which will not be described in detail herein.

S203, determining the pronunciation standard degree of the target voice data according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data.

The reference pronunciation information of the reference text is a standard pronunciation corresponding to the text and can be used as a reference for the pronunciation information of the target voice data. For example, the reference text is a chinese text, and the reference pronunciation information may be a text pinyin, or one or more of a standard phoneme and a tone of the text pinyin. In one implementation, the reference pronunciation information may be obtained from a dictionary and/or a speech library. The dictionary contains pronunciations of all characters or words, the speech library contains all characters (or words) and corresponding pronunciations thereof, and the standard acoustic model of the Chinese pinyin basic phoneme. The phonemes and tones included in the reference pronunciation information may be obtained, for example, by a speech library. Since the pronunciations in the dictionary and the voice library are standard, and the reference pronunciation information of the reference text acquired through the dictionary, the voice library and the like is the standard pronunciation corresponding to the reference text, namely the standard pronunciation corresponding to the target voice data, the pronunciation standard degree of the target voice data can be accurately evaluated according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data.

The pronunciation information of the reference pronunciation information and the target pronunciation data may be data of the same dimension, for example, the pronunciation information and the reference pronunciation information are both pinyin, or the pronunciation information and the reference pronunciation information are both phonemes. The reference pronunciation information and the pronunciation information may also be data of different dimensions, for example, the pronunciation information is pinyin, and the reference pronunciation information is phoneme, but in this case, the phoneme of the pinyin may be further obtained and compared under the same data dimension, so as to obtain the pronunciation standard degree of the target voice data.

In one implementation, similarity calculation may be performed on reference pronunciation information of the reference text and pronunciation information of the target speech data, and a pronunciation criterion of the target speech data may be determined based on the similarity of the two. Pronunciation normalization refers to the degree of normalization or accuracy of pronunciation. In the present application, the pronunciation standard degree may also be referred to as pronunciation accuracy or pronunciation standard degree.

According to the voice data processing scheme provided by the embodiment of the application, voice recognition processing can be carried out on any voice data to obtain a voice recognition result containing text information and pronunciation information, a reference text can be determined based on posterior error correction processing of the text information, the reference text can be used as reference data for evaluating the pronunciation standard of target voice data, and the pronunciation standard degree of the target voice data can be determined through the reference pronunciation information of the reference text and the pronunciation information obtained through recognition. The posterior error correction processing simulates the mode that after the human brain acquires the text, the text can be automatically corrected to obtain the correct text, so that the reference text is obtained, the reference text does not need to be preset, any voice data can acquire the corresponding reference text, the pronunciation standard degree of a section of voice cannot be accurately identified due to the preset content limitation of the reference text, the method and the device can be suitable for accurately evaluating the pronunciation standard degree of the voice of a speaker in any natural state, and the universality under various scenes is improved.

Referring to fig. 3, fig. 3 is a schematic flowchart diagram of a voice data processing method according to an embodiment of the present application. The voice data processing method may be performed by a computer device (e.g., the server 101 in fig. 1). The speech data processing method may include the following contents, which mainly refer to the detailed processing procedure of the posterior error correction processing stage, and may be used as a specific implementation step of S202 in the embodiment described in fig. 2.

In one embodiment, the text information comprises at least one text segment, and the at least one text segment is obtained based on text sentence break of the text information. The text information can be a text sequence without punctuation marks, and the text punctuation refers to the pause position of the selected text for the text information. At least one text segment can be obtained by text sentence-breaking processing of the text information obtained by identifying the target voice data. In one implementation, text sentence break can be performed according to semantics, and text information is divided into at least one text segment, where the text segment may be a single-sentence text or a multi-sentence text. In the process of carrying out posterior error correction processing on the text information, the text segments can be processed as the minimum processing granularity, so that the reference text corresponding to the target voice data is obtained.

S301, performing semantic recognition processing on each text segment included in the text information of the target voice data to obtain semantic integrity of each text segment.

The semantic completeness of each text segment included in the text information can be obtained through semantic recognition processing. The semantic completeness of the text segment refers to the completeness of the language meaning and the language structure of the text segment, and can be used to indicate whether the semantic meaning of the text segment is complete or incomplete. Meaning that the meaning of a language and the structure of a sentence are in accordance with the natural rules and the logic of a convention. For example, the text fragment: "the peanut flies in the sky", which does not conform to the convention and is a text fragment with incomplete semantic meaning. The semantic completeness can be represented by a natural number, for example, a semantic completeness of a text segment is 1 to represent that the semantic is complete, and a semantic completeness is 0 to represent that the semantic is incomplete. But may be represented by other characters.

In an implementation manner, a semantic integrity model may be used to perform semantic identification processing on each text segment included in the text information to obtain semantic integrity of each text segment, the semantic integrity model is pre-trained, and the text segment is processed by the pre-trained semantic integrity model to determine whether the semantic meaning of the text segment is normal (i.e., whether the semantic meaning is complete). For example, if the text segment is "this rice dumpling breaks the earth", the result outputted through the semantic integrity model is not complete in semantic meaning, and if the text segment is "this rice dumpling is really good for eating", the result outputted through the semantic integrity model is complete in semantic meaning.

It can be understood that the semantic recognition processing for each text segment may be performed sequentially according to the output sequence of the speech recognition processing, for example, each time a text segment is output by the speech recognition model, the text segment is subjected to a posterior error correction processing, which includes inputting a semantic integrity model for semantic recognition processing, etc.; the semantic recognition processing may be performed on each text segment in parallel, for example, after the speech recognition model has processed the target speech data and obtained the complete text information, the posterior error correction processing may be performed on a plurality of text segments included in the text information collectively, regardless of the processing order.

S302, determining at least one reference text segment of the target voice data according to the semantic integrity of each text segment.

Assuming that the text information includes N text segments, M reference text segments can be determined according to the semantic completeness of the N text segments, wherein M is less than or equal to N, and M, N are all positive integers. The semantic integrity of the text segment is used for indicating whether the semantic meaning of the text segment is complete, the reference text segment can be determined from the text segment according to the semantic integrity indication of the text segment, and the semantic meaning of some text segments is not complete and cannot be used as the reference text segment after error correction processing, so that each text segment included in the text information does not necessarily have a corresponding reference text segment, namely M < N.

In one embodiment, based on the indication of the semantic completeness of each text segment, different processing strategies may be performed on the text segments with complete and incomplete semantics, and the specific implementation content of S302 may include: aiming at a target text segment included in text information of target voice data, determining the target text segment as a corresponding reference text segment if the semantic integrity of the target text segment indicates that the semantic integrity of the target text segment; if the semantic completeness of the target text segment indicates that the semantic of the target text segment is incomplete, adjusting the target text segment, and determining a reference text segment corresponding to the text segment based on the adjusted target text segment.

The target text segment is any one of at least one text segment included in the text information. If the target text segment is a text segment with complete semanteme, the target text segment is enough standard after error correction in the speech recognition process, and the target text segment can accurately express the meaning of the speech data corresponding to the target text segment, the target text segment can be directly determined as the reference text segment. The target text segment with complete semanteme can be added into a first text set A with normal semanteme, and all text segments in the first text set A can be used as reference text segments. If the target text segment is a text segment with incomplete semanteme, the target text segment obtained through error correction in the voice recognition process is not enough standard, and the target text segment cannot accurately express the meaning of the voice data corresponding to the target text segment, the target text segment can be adjusted first, the target text segment with incomplete semanteme can be corrected to a certain extent through adjustment, and then the adjusted target text segment is used as a reference text segment corresponding to the target text segment when the adjusted target text segment meets the requirement. It can be understood that, in this way, the reference text segment is from a semantically complete text segment, so that the reference text generated based on the reference text segment has reference confidence, thereby ensuring the reliability of the reference data calculated as the pronunciation criterion. It should be noted that, for each text segment included in the text information, the reference text segment corresponding to the text segment may be determined by using the above first or second method according to the semantic integrity of the text segment, so as to obtain the reference text.

Therefore, the semantic integrity of the text segment obtained by voice recognition is judged at a semantic level, whether the text segment can accurately express the meaning of the voice data corresponding to the text segment can be known, the text segment which is originally recognized to be enough standard can be directly used as a reference text segment, the recognized text segment which is not enough standard can be adjusted again to realize error correction, and the text segment after error correction can be used as the reference text segment when the text segment after error correction is enough standard. In any processing mode, the text segment with enough standard is used as the reference text segment, so that the standard degree of the reference text segment is ensured.

In one implementation, determining a reference text segment based on the adjusted target text segment includes: and performing semantic recognition processing on the adjusted target text segment to obtain the semantic integrity of the adjusted target text segment. The adjusted target text segment can also be determined by the above first or second method. Specifically, the method comprises the following steps: and if the semantic integrity indicates that the semantic meaning of the adjusted target text segment is complete, determining the adjusted target text segment as a reference text segment, and adding the adjusted target text segment into the first text set A with normal semantic meaning. On the contrary, if the semantic completeness indicates that the semantic of the adjusted target text segment is incomplete, it indicates that the adjusted target text segment does not have a corresponding reference text segment, the adjusted target text segment may be added to the second text set B with abnormal semantic meaning, and subsequently, the number of text segments in the corresponding text segment set may be counted to play a corresponding role in the pronunciation normalization calculation. Whether the semanteme is complete or not is judged again for the adjusted target text segment, so that the corresponding reference text segment can be obtained as much as possible, and the reliability of the reference text is improved.

In one implementation, when the semantic meaning of the target text segment is incomplete, the target text segment is adjusted, including the following steps 1) to 3):

1) and performing mask processing on any original processing object in the target text fragment to obtain a processed target text fragment.

The target text segment is a text segment with incomplete semanteme, and an original processing object in the target text segment refers to a text processing object obtained by speech recognition and is a processing unit for adjusting the target text segment. For example, the target text segment is a sentence of chinese text, the primitive process object may be a word or a word, the target text segment is a sentence of english text, and the primitive process object may be a word. The MASK processing refers to processing for blocking an original processing object, and may specifically be masking or random replacement, which may also be referred to as MASK processing. The processed target text segment obtained by the masking process is a text segment which blocks an original processing object, and then the language representation model can predict and restore the covered or replaced part. For example, assume that the target text segment is: the method comprises the steps of covering Chinese characters with a character (mask) to obtain a covered target text segment, namely the Chinese characters are good, wherein the character (mask) can also be used for replacing the original Chinese characters to indicate that mask processing is carried out on the Chinese characters.

It should be noted that, for each original processing object in the target text fragment, the processing may be performed in the manners as described in 1) to 3), so as to adjust the target text fragment. For example, MASK processing may be performed sequentially for each word in a sentence of text, and the processing may be performed as described in steps 2) to 3) for each MASK-processed word.

2) And calling a language representation model to perform prediction processing on the processed target text segment to obtain at least one candidate object at the mask position in the target text segment.

The language representation model can be a pre-trained BERT model, the pre-training of the BERT model is carried out based on a mask LM pre-training task, and the mask LM pre-training task specifically refers to a task of inputting words through random masking parts and then predicting the Masked words. The target text segment after the mask processing can be directly input into the language representation model, the language representation model performs the prediction processing, and outputs an output vector corresponding to a mask position, the mask position refers to a position of a shielded original processing object, the output vector can be a word vector, for example, a text covering one word is input into the BERT model, and a vector corresponding to the shielded word output by the BERT model can be obtained. At least one candidate object at a mask position in the target text segment may be derived from the output vector. From the model structure, at least one candidate object can be obtained by adding a classification layer to the output of the encoder (i.e. the encoder) of the BERT model, and the output vector passes through the classification layer. From the processing principle, a target vector can be obtained based on the word embedding matrix (the parameters included in the classification layer) and the output vector, and then the target vector is subjected to classification calculation to obtain at least one candidate object at the position of the mask. The word embedding matrix can be obtained through pre-training, the target vector is a vector of a vocabulary dimension, and the word embedding matrix can be multiplied by the output vector, namely, the output vector can be converted into the dimension of the vocabulary by the word embedding matrix, and specifically, the word embedding matrix is multiplied by the output vector. The classification calculation for the target vector may specifically be a calculation of a prediction probability (e.g. a probability for each word) for each object in the vocabulary using softmax (a normalization function), thereby determining at least one candidate object at the mask position based on the prediction probability of the object. In one implementation, a preset number of objects with prediction probabilities greater than a probability threshold may be determined as at least one candidate object at the mask position, for example, the maximum 5 words with prediction probabilities greater than the threshold are used as candidate words, and if less than 5 words are actually stored, for example, only 4 words are used as candidate objects. The candidate object is data with the same dimension as the original processing object, for example, if the original processing object is a word, the candidate object is a candidate word, and if the original processing object is a word, the candidate object is a candidate word.

Exemplarily, assuming that the target text segment is "huashenghui" and the target text segment after the masking process is "[ mask ] shenghui", where an original word corresponding to [ mask ] is a first word "hua" in the target text segment, each character in the target text segment after the masking process may be represented as a token vector (i.e., an input vector), the token vector may be input to a BERT model for processing, the BERT model may output a word vector (i.e., an output vector) corresponding to a position where the "hua" word is located, and then the output vector passes through a classification layer, where the classification layer processes the output vector, where the classification layer includes: multiplying the word vector by the word embedding matrix to obtain a target vector corresponding to each word in the vocabulary table, and finally calculating the prediction probability of each word in the vocabulary table as the word at the mask processing position through softmax. And then determining candidate words according to a preset screening rule, for example, taking the 5 words with the maximum prediction probability as the candidate words.

3) And adjusting the target text segment according to the at least one candidate object.

By using the predicted at least one candidate object, the target text segment may be adjusted, for example, an original processing object at the position of the mask in the target text segment is replaced by a qualified candidate object in the at least one candidate object. In one implementation, the specific implementation steps of step 3) may include the following (i) - (v):

obtaining the prediction probability corresponding to at least one candidate object.

After the language characterization model prediction processing, at least one candidate object may be obtained, where each candidate object corresponds to a respective prediction probability, and the prediction probability may be obtained by performing classification calculation on the aforementioned target vector and reflects the possibility that the content at the mask position is a candidate object. That is, the likelihood that the candidate may be the content at the mask location. The greater the prediction probability, the greater the likelihood that the content at the mask location is a candidate.

And secondly, taking the candidate object with the highest prediction probability in the at least one candidate object as a first candidate object, and judging whether the first candidate object is an original processing object at the position of the mask.

The first candidate object is a candidate object with the largest prediction probability in the at least one candidate object, and the following step (c) may be performed by determining whether the first candidate object is an original processing object at the mask position, or performed: if the first candidate object is the original processing object at the mask position, which indicates that the original processing object at the mask position is an accurate object, and the standard text of the original pronunciation at the mask position has a high possibility, the original processing object does not need to be replaced. The masking process is performed on the next original processing object by repeating the processes described in the foregoing 1) to 3). That is, the first candidate object is the same as the original object at the mask position, and other candidate objects may not be analyzed one by one, but other original processing objects in the target text segment are directly processed.

And if the first candidate object is not the original processing object at the mask position, calculating the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position.

The first candidate object is different from the original processing object at the mask position, which indicates that the original processing object may be an inaccurate object, that is, the original processing object is not a standard text corresponding to the original pronunciation at the mask position, and needs to be replaced. It is possible to calculate the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position. The pronunciation information at the mask position corresponds to the original processing object at the mask position, and the pronunciation information at the mask position is obtained through voice recognition and is the original pronunciation information of the user. The pronunciation information of the first candidate object is standard pronunciation information acquired through one or more modes of a dictionary, a voice library, a dialect library and the like, and the candidate object similar to the pronunciation of the user can be found by comparing the similarity between the two kinds of pronunciation information, so that the reference text is determined.

In one embodiment, the pronunciation information is pinyin information, the pinyin information includes initials, finals and tones, and calculating the similarity between the pronunciation information of the first candidate and the pronunciation information at the mask position includes: obtaining confusion pairs in a dialect library; determining initial similarity according to the initial consonants in the pronunciation information of the confusion pair and the candidate object and the initial consonants in the pronunciation information at the mask position, and determining final similarity according to the final consonants in the pronunciation information of the confusion pair and the candidate object and the final consonants in the pronunciation information at the mask position; determining a tone similarity between a tone in the pronunciation information of the candidate object and a tone in the pronunciation information at the mask position; and carrying out weighted summation on the initial consonant similarity, the final similarity and the tone similarity to obtain the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position.

The method for calculating the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position comprehensively considers the pronunciation habit of the dialect, combines the confusion pair in the dialect library, comprehensively analyzes the similarity between the initial consonant and the final vowel in the pronunciation information, considers the tone similarity in the final similarity calculation, fully utilizes the contents (including the initial consonant, the final vowel and the tone) contained in the pronunciation information, can effectively improve the accuracy of the similarity of the pinyin information, and can enable different contents in the pinyin information to play corresponding roles by setting the weight when calculating the similarity, thereby realizing scientific evaluation calculation of the similarity.

The confusion pairs in the dialect library comprise initial confusion pairs and final confusion pairs, the initial similarity and the final similarity in the pinyin information can be comprehensively calculated by combining the confusion pairs in the dialect library, and if the pronunciation information has higher similarity in the confusion pairs, the similarity not in the confusion pairs is lower. Specifically, it may be determined whether a consonant in the pronunciation information of the candidate object and a consonant in the pronunciation information at the mask position are in a consonant confusion pair, and if the consonant confusion pair is in the consonant confusion pair, determining that the similarity of the consonants is a first similarity, and if the consonant confusion pair is not in the consonant confusion pair, determining that the similarity of the consonants is a second similarity, where the second similarity is smaller than the first similarity. Since the dialect library includes common confusion pairs, such as [ { s, sh }, { f, h }, { en, eng }, { n, l } … ], the pinyin pairs in the confusion pair may all be set to the first similarity, for example, 0.8, when calculating the similarity, and the pinyin not in the confusion pair may be set to the second similarity, specifically, the ratio distance in the Levenshtein distance may be used, and the calculation is performed according to the actual difference. The similarity of the finals is also a similar principle, and the finals in the pronunciation information and the finals confusion pair are compared, so that the similarity of the finals can be obtained. For example, "multiply" the pronunciation of this candidate word, the standard pronunciation of the text is "cheng 2fa 3", and the pronunciation of the user is "shen 2fa 2", where "chen 2" and "shen 2" in the pinyin of the first word, the initial (sh and ch) is not in the confusing pair, the initial similarity is 0.5, the final (en and eng) is in the confusing pair, and the final similarity is 0.8.

In the pinyin information, the tones refer to tones including first, second, third, fourth and soft. Since the pronunciation of the second sound and the third sound are similar, the second sound and the third sound can be preset as one similarity, and the other sounds are set as the other similarity. In one possible implementation, the pitch similarity may be determined from a preset pitch similarity according to a pitch in the pronunciation information of the candidate object and a pitch of the pronunciation information at the mask position. The preset pitch similarity includes a first pitch similarity and a second pitch similarity. For example, the preset pitch similarity is set as follows: "12":0,"13":0,"14":0,"23":0.5,"24":0,"34":0. Among them, for example, "12": 0 means that the tone similarity of the first sound and the second sound is 0, and '23': 0.5 means that the tone similarity of the second sound and the third sound is 0.5, and the meanings of the rest preset tone similarities are similar. The first pitch similarity is 0 and the second pitch similarity is 0.5. It can be understood that if the tones of the two pieces of pinyin information are identical, the tone similarity is 1. The pronunciation of the word "multiplication" as mentioned above, wherein the tones of "cheng 2" and "shen 2" in the pinyin of the first word are both 2, so the tone similarity can be 1.

The initial consonant similarity, the final similarity and the tone similarity can be respectively provided with corresponding weights, and the initial similarity, the final similarity and the tone similarity are subjected to weighted summation processing based on the respective corresponding weights, so that the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position is obtained, wherein the similarity is the pinyin similarity. For example, the initial similarity is set to 0.4, the final similarity is set to 0.4, the tone similarity is set to 0.2, and according to the content of the foregoing example, the candidate is "multiplication", and the original pronunciation of the user is "shen 2fa 2", where the pinyin similarity of the first word "multiplication" is 0.5 × 0.4+0.8 +0.4+ 1 × 0.2 — 0.72. Similarly, the pinyin "fa 2" and "fa 3" of the second word "method" has an initial similarity of 1, a vowel similarity of 1, a tone similarity of 0.5, and a pinyin similarity of 1 × 0.4+0.5 × 0.2 — 0.9.

In another embodiment, the dialect library also contains the words of common mainstream dialects and their corresponding pronunciations, such as [ { flower, fa1}, { flower, ho1}, { meat, ru2} … ], so that the dialect pronunciation information recorded in the dialect library can be combined in the process of calculating the similarity. The pronunciation information includes dialect pronunciation information which is a local unique pronunciation and standard pronunciation information such as standard pinyin of mandarin. The similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position includes: a dialect similarity between the dialect pronunciation information of the first candidate object and the pronunciation information at the mask position, and/or a standard similarity between the standard pronunciation information of the first candidate object and the pronunciation information at the mask position. That is, the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position may include only the dialect similarity between the dialect pronunciation information of the first candidate object and the pronunciation information at the mask position, may include only the standard similarity between the standard pronunciation information of the first candidate object and the pronunciation information at the mask position, may also include both the dialect similarity between the dialect pronunciation information of the first candidate object and the pronunciation information at the mask position, and the standard similarity between the standard pronunciation information of the first candidate object and the pronunciation information at the mask position.

The similarity between the pronunciation information of the first candidate object and the original pronunciation at the mask position is calculated by introducing the dialect pronunciation information in the dialect library, and the difference between the original pronunciation information of the user and the standard pronunciation information can be compared through the similarity, so that the original pronunciation of the user can be identified even if the original pronunciation of the user is not the standard pronunciation but the dialect pronunciation, and the candidate object is used for replacing the corresponding original processing object at the mask position. By covering comparison of different types of pronunciations (including standard pronunciations and dialect pronunciations), the accuracy of pronunciation recognition can be effectively improved, and the original standard text at the position of the mask code can be more accurately determined.

The standard similarity between the standard pronunciation information of the first candidate and the pronunciation information at the mask position may be calculated in the manner of the above-mentioned calculation in combination with the confusion pair in the dialect library, or may be directly calculated (for example, by using the ratio distance in the Levenshtein distance). The pronunciation information is pinyin information, for example, the first candidate is a word: "flower", the original pronunciation of the user is "fa 1", the standard pinyin of the word is "hua 1", and the calculation of the pinyin similarity "hua 1" and "fa 1" of the word by combining the confusing pairs in the dialect library specifically includes: 0+0.67 0.4+1 0.2 ═ 0.47. The dialect pronunciation information of the first candidate object can be found in the dialect library, and the dialect similarity between the dialect pronunciation information and the pronunciation information at the mask position can be directly calculated or can be calculated by combining the confusion pair mode in the dialect library. For example, traversing the dialect pronunciations of the "flower" word in the dialect library, the pinyin "fa 1" of a dialect of the "flower" word and the pronunciation "fa 1" obtained by speech recognition are the same, and therefore, the dialect similarity is 1.

Further, the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position may be compared with a preset similarity threshold, and the contents described in (iv) or (iv) may be performed according to the difference between the similarity and the preset similarity threshold.

If the similarity is smaller than the preset similarity threshold, taking a second candidate object in the at least one candidate object as a first candidate object, and executing the step of judging whether the first candidate object is the original processing object at the mask position until the first candidate object is the candidate object with the minimum prediction probability in the at least one candidate object.

Under the condition that the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position is smaller than a preset similarity threshold, the fact that the original text contents expressed by the first candidate object and the user at the mask position are not similar is shown, the probability that the first candidate object is a standard text corresponding to the original pronunciation at the mask position is low, and the first candidate object cannot replace the original processing object at the mask position. For example, assuming that the predetermined similarity threshold is 0.8, the similarity between the standard pinyin for the candidate word "flower" and the original pronunciation of the user is 0.47, which is smaller than the predetermined similarity threshold and cannot be replaced. At this time, a second candidate may be determined from the at least one candidate, the second candidate being a candidate having a prediction probability that is less than the prediction probability of the candidates of the first candidate. If the prediction probabilities of at least one candidate object are arranged in descending order, the first candidate object is the candidate object with the highest prediction probability arranged at the first position, and the second candidate object is the candidate object arranged at the second position and is the candidate object with the highest prediction probability in other candidate objects except the first candidate object in the at least one candidate object. For example, assuming that at least one candidate includes 5 candidates, and the candidate having a lower prediction probability than the first candidate includes 4 candidates, the second candidate is the candidate having the highest prediction probability among the 4 candidates.

If the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position is smaller than a preset similarity threshold, the above steps can be executed according to the branch loop, namely, a second candidate object is determined from at least one candidate object, and the second candidate object is taken as the first candidate object, wherein the second candidate object is a candidate object with the highest prediction probability in the candidate objects with the lower prediction probability than the first candidate object (namely, the second candidate object determined last time). And executing the step of judging whether the first candidate object is the original processing object at the mask position until the first candidate object is the candidate object with the minimum prediction probability.

And if the similarity between the pronunciation information of the candidate object with the minimum prediction probability and the pronunciation information at the mask position is smaller than a preset similarity threshold, marking the original processing object at the mask position in the target text fragment. At this time, all candidate objects at the mask position are sequentially analyzed, and the original processing object is not replaced, which indicates that the pronunciation of the original processing object by the user is very inaccurate, and the corresponding correct text cannot be obtained through posterior error correction, so by marking the original processing object at the mask position in the target text segment, for example, marking words in the text of the sentence, the original processing object at the mask position can be output later to prompt the user. For marked target text segments, the marked target text segments can be directly added to the second text set. And then, continuing to perform mask processing on the next original processing object, namely repeatedly performing the steps 1) to 3).

And if the similarity is greater than or equal to the preset similarity threshold, replacing the original processing object at the mask position with the first candidate object.

Under the condition that the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position is greater than or equal to the preset similarity threshold, it is indicated that the first candidate object is very close to the original text content expressed by the user at the mask position, the probability that the first candidate object is the standard text corresponding to the original pronunciation at the mask position is high, and the original processing object at the mask position can be directly replaced by the first candidate object, so that the target text segment is adjusted. For example, if the similarity calculated by combining the standard pronunciation of the candidate word "hua" and the pronunciation obtained by recognition with the dialect pinyin information is 1, which is greater than the preset similarity threshold, the original processing object "hua" may be replaced by the "hua" word. It is understood that after the original processing object at the mask position is replaced, the next original processing object in the target text segment can be directly processed without analyzing other candidate objects, that is, the contents of 1) to 3) are processed.

For example, when a speaker speaks a piece of speech, speech data is collected, and the result of speech recognition is as follows: the textual results (i.e., text information) are: "Huashengyan is good and has a special taste. "; the pinyin result (i.e., pronunciation information) is: : "hua 2sheng1 zheng1 hao3, you3 yi4 zhong3te 4bie2 de5 wei4 dao 4. "(1: one, 2: two, 3: three, 4: four, 5: soft).

In the process of the speech recognition process, the speech recognition process itself may have a certain error correction mechanism, so that the text corresponding to "zheng 1 hao 3" in the pinyin result is corrected to be "true".

In the post-verification error correction processing stage, when MASK operation is performed on the "hua" word, the probability of the output vector "hua" word at the MASK position is the largest, according to the similarity calculation method described above, the pinyin similarity between the speaker and the "hua" word is 0.4+0.4+ 0-0.8, and the pinyin similarity is equal to the preset similarity threshold 0.8, so that "hua" can be replaced by "hua" word.

It should be noted that, each original processing object in the target text fragment may be executed according to the contents described in 1) to 3), for each masked original processing object, at least one candidate object at the mask position may be determined, and the original processing object is replaced when the candidate object satisfies a replacement condition (i.e., the candidate object is not the original processing object and the similarity is greater than or equal to a preset similarity threshold), for example, a sentence that occludes the first word is predicted by the language representation model, and 4 candidate words corresponding to the first character position may be obtained, so that when any candidate object in the at least one candidate object satisfies the replacement condition, the occluded first word in the target text fragment may be replaced with the candidate word. The sentence blocking the second character is predicted by a language representation model, so that 3 candidate characters corresponding to the second character position can be obtained, and the blocked second character in the target text segment can be replaced by the candidate character meeting the condition. And marking the corresponding original processing object when each candidate object does not meet the replacement condition, so as to realize the posterior error correction processing of the target text segment and obtain the adjusted target text segment. The adjusted target text fragment can judge the semantic completeness again, if the semantic completeness is complete, the target text fragment can be added into the first text set and can be used as a reference text fragment, otherwise, the target text fragment can be added into the second text set, and the pronunciation standard degree can be calculated by using the text fragment in the first text set subsequently. The first text set includes an original text segment and an adjusted text segment.

In summary, in the process of the posterior error correction, the text information obtained by recognition may be subjected to error correction according to information in a semantic layer, and specifically, different processing schemes may be adopted for situations where the semantic meaning is complete or incomplete: the text segment with complete semanteme can be directly used as a reference text segment, and the text segment with incomplete voice can be subjected to error correction processing by combining data in a dialect library, including dialect pronunciation information, confusion and the like, so as to comprehensively evaluate the pronunciation standard degree of a speaker. The posterior error correction processing mechanism imitates the error correction mechanism of human brain, and after a speaker speaks a section of speech, the speaker automatically identifies the original text and pinyin of the section of speech and whether the speech is standard or not.

After each text segment included in the text information is processed according to the content described above, at least one reference text segment can be obtained, and according to the processing condition of the text segment, each text segment may correspond to a reference text segment, and also a part of the text segments may correspond to a reference text segment. The reference text segment is used to generate a reference text, which can be referred to in detail in the introduction of S303.

S303, generating a reference text corresponding to the target voice data according to the at least one reference text segment.

The at least one reference text segment is a text segment with complete semanteme, if the number of the at least one reference text segment is one, the reference text segment can be directly used as a reference text corresponding to the target voice data, and if the number of the at least one reference text segment is multiple, the reference text can be generated based on the reference text segments. In an implementation manner, at least one reference text segment may be combined according to an arrangement order of the text segments in the text information to obtain a reference text corresponding to the target speech data. The character sequence contained by the reference text segment in the reference text is the same as the character sequence in the recognized text information. For example, the text information is: "Huashenghui is good and has a special taste. "the reference text is: "peanut is really good and has a special taste. ".

The data processing scheme provided by the embodiment of the application can determine the voice recognition effect of the text information obtained by voice recognition and correct the voice before evaluating the text information and the pronunciation information obtained by any section of voice recognition, specifically, can judge each text segment contained in the text information at the semantic level in the post-error correction processing stage, determine the voice recognition effect of the text segment, namely the complete or incomplete semantic meaning of the text segment, and can select different processing strategies according to the voice recognition effect to obtain the reference text, so that the judgment on the semantic integrity of the text segment can directly know whether the text segment can accurately express the meaning of the voice data corresponding to the text segment, and directly determine the text segment as the reference text segment when the text segment expresses accurately enough, the text segments which are not accurately expressed can be further adjusted, and the text segments are determined to be the reference text segments when the text segments are accurate enough, the processing strategies simulate an error correction mechanism of the human brain, error correction can be automatically carried out after text information is obtained, so that the reference text can be obtained in real time, the reference text is generated based on the accurate reference text segments, and the standard degree of the reference text can be effectively ensured. In the method, a text reference library does not need to be prepared independently, and the method is also beneficial to the accurate evaluation of the pronunciation standard degree in various scenes.

Referring to fig. 4, fig. 4 is a third schematic flowchart of a voice data processing method according to an embodiment of the present application. The voice data processing method may be performed by a computer device (e.g., the server 101 in fig. 1). The voice data processing method may include the following.

S501, performing voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information.

S502, posterior error correction processing is carried out on the text information to obtain a reference text corresponding to the target voice data.

The above steps S501 and S502 can refer to the content introduced in the corresponding embodiment of fig. 2 and the corresponding embodiment of fig. 3, and are not repeated herein. The pronunciation criterion of the target speech data may be determined based on the reference pronunciation information of the reference text and the pronunciation information of the target speech data, and the specific determination may be found in S503 and S504 described below.

S503, determining a target similarity between the reference pronunciation information of the reference text and the pronunciation information of the target speech data.

After obtaining the reference text corresponding to the target speech data, reference pronunciation information of the reference text may be obtained first, where the reference pronunciation information is standard pronunciation information of the reference text. The method can be obtained through a dictionary and/or a phonetic library, wherein the dictionary and the phonetic library both contain all words (or words) and corresponding pronunciations thereof, and the phonetic library also comprises standard acoustic models of pronunciations basic phonemes (such as Chinese pinyin basic phonemes).

In one embodiment, the pronunciation information of the target voice data is a voice pinyin obtained through voice recognition processing, the reference pronunciation information of the reference text includes text pinyins of all text segments with complete semantemes, the text pinyins are obtained through a dictionary, the target similarity can be a pinyin similarity between the voice pinyin and the text pinyins, and in one implementation, the pinyin similarity can be comprehensively calculated by combining phonemes and tones. Each reference phoneme and reference tone of all the text pinyins can be obtained through a speech library, the reference phoneme (or reference tone) is a standard phoneme (or tone), and the phoneme and the tone of the speech pinyins can be extracted for similarity calculation. The implementation manner of S503 may include: determining phoneme similarity between a phoneme in the pronunciation information of each processing object in the text information and a reference phoneme in the reference pronunciation information of each processing object in the reference text, and determining pitch similarity between a pitch in the pronunciation information of each processing object in the text information and a reference pitch in the reference pronunciation information of each processing object in the reference text; and determining the target similarity according to the phoneme similarity and the note similarity.

The processing object is a processing unit for calculating similarity, and the processing object may be a word, or a text segment/reference text segment. The text segments in the text information and the reference text segments in the reference text are in one-to-one correspondence, and the phoneme similarity and the note similarity are the similarities of phonemes or tones between corresponding processing objects. For example, the phoneme similarity between the phoneme of the pinyin for the first word in the text segment and the phoneme of the standard pinyin for the first word in the reference text segment corresponding to the text segment. The phoneme similarity and the note similarity can be weighted and summed to obtain the target similarity, and the weighting specifically can be average weighting processing.

It should be noted that, since a partial text segment may not necessarily have a corresponding reference text segment, the calculation of the phoneme similarity, the pitch similarity, and the like is based on the reference text segment. If a text segment has no corresponding reference text segment, the target similarity of the text segment may be directly counted as 0 or ignored.

In the scheme, the similarity calculation is carried out between the standard phoneme and the tone of the text pinyin and the phoneme and the tone of the corresponding voice pinyin, so that the pinyin information can be comprehensively counted, and the target similarity between the reference pronunciation information and the pronunciation information obtained by recognition can be accurately calculated.

In one implementation, the target similarity obtained by calculation may be converted into a score corresponding to the processing object, and then the score of the target voice data is determined, so as to obtain the pronunciation standard degree. For example, the target similarity is directly determined as the score of the processing object, or the target similarity is determined as the score of the processing object when the target similarity is greater than the threshold, and the specific implementation content may be referred to in S504 described below.

S504, determining the pronunciation standard degree of the target voice data according to the pronunciation evaluation grade standard and the target similarity.

The pronunciation assessment level criteria may be used to assess the pronunciation level of the target speech data. The target similarity may include a pronunciation similarity between pronunciation information corresponding to a unit processing object (e.g., each word or word) in the recognized text information and reference pronunciation information of the reference text.

In one implementation, S504 includes: if the target similarity is larger than or equal to the similarity threshold required by the pronunciation evaluation grade standard, taking the target similarity as the score of the target voice data corresponding to the processing object; determining pronunciation scores of text segments corresponding to the target voice data based on the scores of the processing objects corresponding to the target voice data; and determining the pronunciation standard degree of the target voice data based on the pronunciation scores of the text segments, the ratio of the number of the reference text segments included in the reference text to the number of the text segments included in the text information.

Different pronunciation evaluation grade standards are adopted, and measurement modes of pronunciation accuracy of the target voice data are different based on the target similarity. Specifically, the similarity thresholds required by different pronunciation evaluation level criteria are different, and the finally determined pronunciation standard degree of the target voice data is also different.

Optionally, the pronunciation assessment level criteria are divided into three levels, i.e., a high level, a middle level and a low level, and the similarity threshold required under each level is different. The advanced pronunciation assessment level standard requires that the similarity threshold is 100%, that is, the pronunciation information of the target voice data and the reference pronunciation information of the reference text must be completely consistent to score, otherwise, the score is 0, for example, the pinyin of two characters is inconsistent, and the character is scored as 0. The similarity threshold required by the medium pronunciation assessment level criteria may be a value greater than 0, such as 60%. If the target similarity is greater than the similarity threshold, it indicates that the processing object corresponding to the target similarity is accurate in pronunciation, and the target similarity may be used as the score of the processing object corresponding to the target voice data, otherwise, the pronunciation of the processing object corresponding to the target similarity is inaccurate, and the score of the processing object corresponding to the target voice data may be set to 0, for example, if the target similarity of the pronunciations of two characters is greater than the similarity threshold, the target similarity may be retained as the score of the character, and if the target similarity is less than the similarity threshold, the character may be scored as 0. The similarity threshold required by the low-level pronunciation evaluation level standard is 0, so that the calculated target similarity can be directly used as the score of the corresponding processing object, for example, the target similarity between the identified pinyin and the corresponding reference pinyin in the reference text is 0.5, and then the score of the character corresponding to the pinyin is 0.5.

The setting of various pronunciation evaluation grade standards can provide different evaluation standards, the scores of all processing objects of each text segment in the text information can be obtained by adopting any one of the pronunciation evaluation grade standards, then the scores of all the text segments are calculated by weighting the scores of all the processing objects, the weighting specifically adopts an average weighting strategy, which is equivalent to calculating an average number, and other weighting strategies can also be adopted, such as a method of increasing the weighting by wrong characters; and then the score of the whole text information is obtained by weighting the scores of the text segments. Here, the weighting may be an average weighting, which is equivalent to calculating an average. Further, the score of the whole text information is calculated according to the ratio of the number of the reference text segments in the reference text to the number of the text segments in the text information, and a final score is obtained. For example, the reference text includes 9 reference text segments, the text message includes 10 text segments, and the score calculated according to the above manner is 0.8, so the pronunciation standard is: 0.8 × 0.9+0 × 0.1 ═ 0.72. In another implementation manner, the score of the whole text information can also be used as the pronunciation standard degree of the target voice data, and the calculation is not required to be performed by combining the ratio between the number of the reference text segments in the reference text and the number of the text segments in the text information. For example, the score of 0.8 of the text information can be directly used as the phonetic standard degree of the target voice data.

Therefore, the pronunciation standard degree is evaluated through the target similarity of the processing object and the similarity threshold required by the pronunciation evaluation grade standard, the original pronunciation information and the standard pronunciation information corresponding to the target voice data can be calculated according to the requirement of the pronunciation evaluation grade standard from small to large, and therefore the original pronunciation corresponding to each unit processing object in the text information can be judged according to the corresponding score to judge whether the pronunciation level or the pronunciation is standard, and accordingly, the evaluation result can be presented to the user more carefully. For the pronunciation standard degree of the whole target voice data, calculation and statistics are carried out layer by layer from the minimum processing object to the text segment and then from the text segment to the text information, so that the pronunciation standard degree can be obtained and evaluated in more detail, and the finally calculated pronunciation standard degree can be more accurate. On the basis, the ratio of the number of the reference text segments to the number of the text segments is introduced and is used as the weight to be calculated together with the score, so that the reliability of the evaluation effect of the pronunciation standard degree can be improved.

In one implementation, if the target similarity is less than a similarity threshold required by the pronunciation assessment level criteria, the processing object corresponding to the target similarity is marked, and the score of the processing object corresponding to the target voice data is set to zero. For example, a word may be marked, and the marked word is where the pronunciation is inaccurate. The target similarity does not reach the similarity threshold required by the pronunciation standard degree, which indicates that the processing object corresponding to the target similarity is inaccurate in pronunciation, marks the processing object with inaccurate pronunciation, and can be conveniently output to a user for display prompt.

In one embodiment, it is also possible to: and outputting the data processing result and the evaluation information of the target voice data. The data processing result includes one or more of the results of the speech recognition processing and the posterior error correction processing on the target speech data, and specifically, the data processing result includes at least one of the following: textual information, pronunciation information, and reference text.

The evaluation information refers to a data result for evaluating the pronunciation of the target voice data, and specifically, the evaluation information includes at least one of the following: pronunciation standard degree, pronunciation score of each text segment and marking prompt information. The marking prompt information is used for prompting a marked processing object in the text information; the marked processing object includes at least one of: the processing object with the target similarity smaller than the similarity threshold required by the pronunciation evaluation grade standard and the marked original processing object in the posterior error correction processing process.

The marked processing object is a place which is automatically recognized and has inaccurate pronunciation, for example, a word with the wrong pronunciation in a sentence, and can be displayed by prompting through output, so that a speaker is helped to better know the character or the word with the inaccurate pronunciation. In addition, the pronunciation standard degree (e.g., total score of speaker) of the target voice data and the pronunciation score (e.g., score per sentence) of the text segment may be output for prompt display.

Illustratively, the speech recognition text results for the target speech data are: "Huashengyuan is good and has a special taste", the phonetic result of the speech recognition is: "hua 2sheng1 zheng1 hao3, you3 yi4 zhong3te 4bie2 de5 wei4 dao 4";

the text results (i.e., reference text) after the posterior error correction process are: "peanut is true and has a special taste", the result of the character-to-pinyin (standard) after the posterior error correction treatment is: "hua 1 sheng1 zhen1 hao3, you3 yi4 zhong3te 4bie2 de5 wei4 dao 4";

and calculating scores of the pinyin of each word in turn, and scoring the similarity according to the ratio definition in the Levenshtein distance (namely calculating the target similarity). The calculation formula of ratio is: (sum-class edit distance)/sum, where the class edit distance is the number of times two strings become the same through how many operations, where the delete and insert distance +1, but the replace distance +2, and sum is the total length of two strings. If the score according to the ratio distance is respectively: [0.75,1.0,0.91,1.0,1.0,1.0,1.0,1.0 ] total score: 0.97, printed prompts are shown in fig. 5. It should be noted that, the similarity score may also adopt a strategy of obtaining 1 score by using the same pinyin, and obtaining 0 score by using different pinyin, which may be specifically set according to the requirement, and is not limited thereto.

By visualizing the data processing result and the evaluation information of the target voice data, the data information can be visually provided for the user, the user can know the specific place with inaccurate pronunciation, the integral pronunciation level and the like in the voice data very conveniently, the pronunciation identification content, the evaluation content and the like can be displayed to the user in detail under the scenes such as oral practice evaluation, Mandarin practice evaluation and the like, and the method is favorable for helping the user to know the problem and correct the pronunciation.

In summary, the flowchart for evaluating the pronunciation criteria and outputting the prompt as shown in fig. 6 can be provided for exemplary illustration. Aiming at the speech of any segment of specified language of a user, such as Chinese speech, a speech recognition result including a text and pinyin can be obtained through speech recognition processing, and then each sentence is taken as processing granularity, and posterior error correction processing is carried out on each sentence (specifically a sentence text) to obtain a reference text including the text after the posterior error correction processing. Then, the standard pinyin of the reference text can be obtained through a dictionary and a voice library, the standard pinyin of the reference text is compared with the pinyin recognized by voice, the phoneme and tone characteristics of the pinyin of each word in the text obtained by voice recognition can be extracted and compared with the characteristics (including phoneme and tone) of the standard pinyin to obtain the score of each word, and the score of each word can be obtained according to the score of each word so as to obtain the score of the whole text. Words with scores less than the similarity threshold and words identified in the posterior error correction process that are highly inaccurate (i.e., each candidate object cannot replace the word) may be marked out and presented with a display prompt.

It should be noted that the present scheme can be adopted for the judgment strategy of the pronunciation standard degree of any language, and the calculation strategy of the evaluation score (i.e. pronunciation standard degree) and the related weight can be modified according to the requirement. In addition, if the pronunciation information obtained by speech recognition is phoneme information, in the embodiment, when calculating the target similarity of a single processing object, the phoneme similarity of the processing object may be specifically calculated, and the processing object may be a word.

According to the voice data processing scheme provided by the embodiment of the application, a reference text is not required to be provided in advance, the error correction mechanism of the human brain is simulated to automatically perform posterior error correction on the acquired text information to obtain the reference text, a text reference library is not required to be prepared independently, the voice pronunciation standard degree of a natural state can be judged, and the voice data processing scheme can be applied to scenes such as interviews, quality control customer service and the like. In the method for evaluating the pronunciation standard degree of the target voice data by using the reference text, the pronunciation standard degree of the target voice data can be accurately determined by combining the pronunciation evaluation grade standard, and the output of a data processing result and evaluation information is supported to give evaluation and prompt, so that a speaker can intuitively know the voice standard condition.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present application. The voice data processing apparatus may be a computer program (including program code) running on a computer device, for example, the voice data processing apparatus is an application software; the voice data processing device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 7, the voice data processing apparatus 800 may include at least one of: a processing module 801 and a determination module 802.

The processing module 801 is configured to perform voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data, where the voice recognition result includes text information and pronunciation information;

the processing module 801 is configured to perform posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data;

the determining module 802 is configured to determine the pronunciation normalization of the target voice data according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data.

In one embodiment, the text information comprises at least one text segment, and the at least one text segment is obtained based on text sentence break of the text information; the processing module 801 is specifically configured to: performing semantic recognition processing on each text segment included in the text information of the target voice data to obtain the semantic integrity of each text segment; determining at least one reference text segment of the target voice data according to the semantic integrity of each text segment; and generating a reference text corresponding to the target voice data according to the at least one reference text segment.

In an embodiment, the processing module 801 is specifically configured to: aiming at a target text fragment included in text information of target voice data, if the semantic integrity of the target text fragment indicates that the semantic integrity of the target text fragment is complete, determining the target text fragment as a reference text fragment, wherein the target text fragment is any one of at least one text fragment included in the text information; and if the semantic completeness of the target text segment indicates that the semantic of the target text segment is incomplete, adjusting the target text segment, and determining a reference text segment corresponding to the target text segment based on the adjusted target text segment.

In an embodiment, the processing module 801 is specifically configured to: masking any original processing object in the target text fragment to obtain a processed target text fragment; calling a language representation model to perform prediction processing on the processed target text segment to obtain at least one candidate object at the mask position in the target text segment; and adjusting the target text segment according to the at least one candidate object.

In an embodiment, the processing module 801 is specifically configured to: obtaining a prediction probability corresponding to each of the at least one candidate object, wherein the prediction probability is used for reflecting the possibility that the content at the mask position is the candidate object; taking a candidate object with the highest prediction probability in at least one candidate object as a first candidate object, and judging whether the first candidate object is an original processing object at a mask position; if the first candidate object is not the original processing object at the mask position, calculating the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position; if the similarity is smaller than a preset similarity threshold, taking a second candidate object in the at least one candidate object as a first candidate object, and executing a step of judging whether the first candidate object is an original processing object at a mask position until the first candidate object is a candidate object with the smallest prediction probability in the at least one candidate object, wherein the second candidate object is a candidate object with the largest prediction probability in the candidate objects with the smaller prediction probability than the first candidate object; and if the similarity is greater than or equal to a preset similarity threshold, replacing the original processing object at the mask position by using the first candidate object.

In one embodiment, the pronunciation information is pinyin information, and the pinyin information comprises initials, finals and tones; the processing module 801 is specifically configured to: obtaining confusion pairs in a dialect library; determining initial similarity according to the initial consonants in the pronunciation information of the confusion pair and the candidate object and the initial consonants in the pronunciation information at the mask position, and determining final similarity according to the final consonants in the pronunciation information of the confusion pair and the candidate object and the final consonants in the pronunciation information at the mask position; determining a tone similarity between a tone in the pronunciation information of the candidate object and a tone in the pronunciation information at the mask position; and carrying out weighted summation on the initial similarity, the final similarity and the tone similarity to obtain the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position.

In an embodiment, the determining module 802 is specifically configured to: determining a target similarity between reference pronunciation information of the reference text and pronunciation information of the target voice data; and determining the pronunciation standard degree of the target voice data according to the pronunciation evaluation grade standard and the target similarity.

It can be understood that the functions of the functional modules of the speech data processing apparatus described in the embodiment of the present application can be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process of the method can refer to the relevant description of the foregoing method embodiment, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. The computer device 910 may comprise a standalone device (e.g., one or more of a server, a node, a terminal, etc.) or may comprise a component (e.g., a chip, a software module, or a hardware module, etc.) within the standalone device. The computer device 910 may comprise at least one processor 911 and a communication interface 912, and further optionally, the computer device 910 may also comprise at least one memory 913 and a bus 914. The processor 911, communication interface 912, and memory 913 are coupled via bus 914, among other things.

The processor 911 is a module for performing arithmetic operation and/or logical operation, and may specifically be one or a combination of multiple processing modules, such as a Central Processing Unit (CPU), a picture processing Unit (GPU), a Microprocessor (MPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Complex Programmable Logic Device (CPLD), a coprocessor (assisting the central processing Unit to complete corresponding processing and Application), and a Micro Control Unit (MCU).

Communication interface 912 may be used to provide information input or output to at least one processor. And/or, the communication interface 912 may be used for receiving and/or transmitting data from/to the outside, and may be a wired link interface such as an ethernet cable, and may also be a wireless link (Wi-Fi, bluetooth, general wireless transmission, vehicle-mounted short-range communication technology, other short-range wireless communication technology, and the like) interface. Communication interface 912 may serve as a network interface.

The memory 913 is used to provide a storage space in which data, such as an operating system and computer programs, may be stored. The memory 913 may be one or a combination of Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or portable read-only memory (CD-ROM), among others.

The at least one processor 911 in the computer apparatus 910 is adapted to invoke computer programs stored in the at least one memory 913 for performing the speech data processing methods described in the embodiments shown in the present application.

In one possible implementation, the processor 911 in the computer device 910 is configured to invoke a computer program stored in the at least one memory 913 for performing the following operations: performing voice recognition processing on the target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information; carrying out posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data; and determining the pronunciation standard degree of the target voice data according to the reference pronunciation information of the reference text and the pronunciation information of the target voice data.

In one embodiment, the text information comprises at least one text segment, and the at least one text segment is obtained by text sentence breaking on the text information; the processor 911 is specifically configured to: performing semantic recognition processing on each text segment included in the text information of the target voice data to obtain the semantic integrity of each text segment; determining at least one reference text segment of the target voice data according to the semantic integrity of each text segment; and generating a reference text corresponding to the target voice data according to the at least one reference text segment.

In one embodiment, the processor 911 is specifically configured to: aiming at a target text segment included in text information of target voice data, if the semantic integrity of the target text segment indicates that the semantic integrity of the target text segment is complete, determining the target text segment as a reference text segment, wherein the target text segment is any one of at least one text segment included in the text information; and if the semantic completeness of the target text segment indicates that the semantic of the target text segment is incomplete, adjusting the target text segment, and determining a reference text segment corresponding to the target text segment based on the adjusted target text segment.

In one embodiment, the processor 911 is specifically configured to: masking any original processing object in the target text fragment to obtain a processed target text fragment; calling a language representation model to perform prediction processing on the processed target text segment to obtain at least one candidate object at the mask position in the target text segment; and adjusting the target text segment according to the at least one candidate object.

In one embodiment, the processor 911 is specifically configured to: obtaining a prediction probability corresponding to each candidate object, wherein the prediction probability is used for reflecting the possibility that the content at the mask position is the candidate object; taking a candidate object with the highest prediction probability in at least one candidate object as a first candidate object, and judging whether the first candidate object is an original processing object at a mask position; if the first candidate object is not the original processing object at the mask position, calculating the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position; if the similarity is smaller than a preset similarity threshold, taking a second candidate object in the at least one candidate object as a first candidate object, and executing a step of judging whether the first candidate object is an original processing object at a mask position until the first candidate object is a candidate object with the minimum prediction probability in the at least one candidate object, wherein the second candidate object is a candidate object with the maximum prediction probability in the candidate objects with the prediction probability smaller than that of the first candidate object; and if the similarity is greater than or equal to a preset similarity threshold, replacing the original processing object at the mask position by using the first candidate object.

In one embodiment, the pronunciation information is pinyin information, and the pinyin information comprises initials, finals and tones; the processor 911 is specifically configured to: obtaining confusion pairs in a dialect library; determining initial similarity according to the confusion pair, the initial in the pronunciation information of the candidate object and the initial in the pronunciation information at the mask position, and determining final similarity according to the confusion pair, the final in the pronunciation information of the candidate object and the final in the pronunciation information at the mask position; determining a tone similarity between a tone in the pronunciation information of the candidate object and a tone in the pronunciation information at the mask position; and carrying out weighted summation on the initial consonant similarity, the final similarity and the tone similarity to obtain the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position.

In one embodiment, the processor 911 is specifically configured to: determining a target similarity between reference pronunciation information of the reference text and pronunciation information of the target voice data; and determining the pronunciation standard degree of the target voice data according to the pronunciation evaluation grade standard and the target similarity.

It should be understood that the computer device 910 described in this embodiment may perform the description of the voice data processing method in the corresponding embodiment, and may also perform the description of the voice data processing apparatus 800 in the corresponding embodiment of fig. 7, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

In addition, it should be further noted that an exemplary embodiment of the present application further provides a storage medium, where the storage medium stores a computer program of the foregoing voice data processing method, where the computer program includes program instructions, and when one or more processors load and execute the program instructions, the description of the voice data processing method in the embodiment may be implemented, which is not described herein again, and beneficial effects of using the same method are also described herein without being described again. It will be understood that the program instructions may be deployed to be executed on one computer device or on multiple computer devices that are capable of communicating with each other.

The computer-readable storage medium may be the voice data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

In one aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by one aspect of the embodiments of the present application.

In one aspect of the present application, another computer program product is provided, which includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the steps of the voice data processing method provided by the embodiment of the present application are realized.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device can be merged, divided and deleted according to actual needs.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of processing speech data, the method comprising:

carrying out voice recognition processing on target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information;

2. The method of claim 1, wherein the textual information includes at least one text segment, the at least one text segment being derived based on text-breaking the textual information;

the performing posterior error correction processing on the text information to obtain a reference text corresponding to the target voice data includes:

performing semantic recognition processing on each text segment included in the text information of the target voice data to obtain semantic integrity of each text segment;

determining at least one reference text segment of the target voice data according to the semantic integrity of each text segment;

and generating a reference text corresponding to the target voice data according to the at least one reference text segment.

3. The method of claim 2, wherein said determining at least one reference text segment of said target speech data based on said semantic integrity of each said text segment comprises:

for a target text segment included in the text information of the target voice data, if the semantic integrity of the target text segment indicates that the semantic integrity of the target text segment is complete, determining the target text segment as a reference text segment, where the target text segment is any one of at least one text segment included in the text information;

and if the semantic completeness of the target text segment indicates that the semantic of the target text segment is incomplete, adjusting the target text segment, and determining a reference text segment corresponding to the target text segment based on the adjusted target text segment.

4. The method of claim 3, wherein said adjusting said target text passage comprises:

masking any original processing object in the target text fragment to obtain a processed target text fragment;

calling a language representation model to perform prediction processing on the processed target text segment to obtain at least one candidate object at a mask position in the target text segment;

and adjusting the target text segment according to the at least one candidate object.

5. The method of claim 4, wherein said adjusting said target text passage according to said at least one candidate object comprises:

obtaining a prediction probability corresponding to each of the at least one candidate object, where the prediction probability is used to reflect the possibility that the content at the mask position is a candidate object;

taking the candidate object with the highest prediction probability in the at least one candidate object as a first candidate object, and judging whether the first candidate object is an original processing object at the mask position;

if the first candidate object is not the original processing object at the mask position, calculating the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position;

if the similarity is smaller than a preset similarity threshold, taking a second candidate object in the at least one candidate object as the first candidate object, and executing the step of judging whether the first candidate object is the original processing object at the mask position until the first candidate object is the candidate object with the smallest prediction probability in the at least one candidate object, wherein the second candidate object is the candidate object with the largest prediction probability in the candidate objects with the smaller prediction probability than the first candidate object;

and if the similarity is greater than or equal to a preset similarity threshold, replacing the original processing object at the mask position with the first candidate object.

6. The method of claim 5, wherein the pronunciation information is pinyin information, the pinyin information including initials, finals, and tones;

the calculating the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position includes:

obtaining confusion pairs in a dialect library;

determining initial similarity according to the confusion pair, the initial in the pronunciation information of the candidate object and the initial in the pronunciation information at the mask position, and determining final similarity according to the confusion pair, the final in the pronunciation information of the candidate object and the final in the pronunciation information at the mask position;

determining a tonal similarity between a tone in the pronunciation information of the candidate object and a tone in the pronunciation information at the mask position;

and carrying out weighted summation on the initial similarity, the final similarity and the tone similarity to obtain the similarity between the pronunciation information of the first candidate object and the pronunciation information at the mask position.

7. The method according to any one of claims 1 to 6, wherein the determining the pronunciation standard degree of the target speech data according to the reference pronunciation information of the reference text and the pronunciation information of the target speech data comprises:

determining a target similarity between reference pronunciation information of the reference text and pronunciation information of the target voice data;

and determining the pronunciation standard degree of the target voice data according to the pronunciation evaluation grade standard and the target similarity.

8. A speech data processing apparatus, comprising:

the processing module is used for carrying out voice recognition processing on target voice data to obtain a voice recognition result of the target voice data, wherein the voice recognition result comprises text information and pronunciation information;

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the voice data processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, perform the speech data processing method of any one of claims 1-7.