CN115359799A - Speech recognition method, training method, device, electronic equipment and storage medium - Google Patents

Speech recognition method, training method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115359799A
CN115359799A CN202210995296.XA CN202210995296A CN115359799A CN 115359799 A CN115359799 A CN 115359799A CN 202210995296 A CN202210995296 A CN 202210995296A CN 115359799 A CN115359799 A CN 115359799A
Authority
CN
China
Prior art keywords
editing
semantic
text
parameter
performance evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210995296.XA
Other languages
Chinese (zh)
Inventor
杜秋蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210995296.XA priority Critical patent/CN115359799A/en
Publication of CN115359799A publication Critical patent/CN115359799A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Psychiatry (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a speech recognition method, a training method, a device, an electronic apparatus, and a storage medium, where the method includes acquiring a target speech, determining a recognition text of the target speech based on a speech recognition model, and the model is obtained by training in the following manner: the method comprises the steps of predicting a predicted text of a voice sample by using a model, updating model parameters of the model when performance evaluation indexes of the model meet iteration conditions, wherein the index parameters comprise identification quantization parameters and semantic difference parameters of a statistical object contained in the predicted text, the identification quantization parameters comprise identification correct quantization data when the statistical object is identified correctly, the identification quantization parameters comprise editing quantization data of editing operation when the statistical object is identified incorrectly, and the semantic difference parameters are used for correcting the editing quantization data. At the moment, the performance evaluation index can be used for differentially measuring the influence of the predicted text on the understanding of the user, and the phenomenon that the recognized text with large semantic deviation is mistakenly considered as the correct text to be output is avoided.

Description

Speech recognition method, training method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a training device, an electronic device, and a storage medium.
Background
In the Speech Recognition technology (ASR), performance indexes such as a Character Error Rate (CER), a Word Error Rate (WER), and the like are mainly adopted to evaluate the performance of an ASR algorithm.
When the speech recognition model is trained, the predicted text of the speech sample can be recognized based on the speech recognition model, then the predicted text and the reference text are compared to obtain an evaluation index parameter, and the performance evaluation indexes of the ASR algorithm are determined based on the evaluation index parameter.
Disclosure of Invention
According to an aspect of the present disclosure, there is provided a speech recognition method including:
obtaining target voice, and determining a recognition text of the target voice based on a voice recognition model, wherein the voice recognition model is obtained by training in the following way:
predicting a predicted text of a voice sample by using the voice recognition model, updating model parameters of the voice recognition model in response to that a performance evaluation index of the voice recognition model meets an iteration condition, wherein the performance evaluation index is used for measuring the error rate of the predicted text of the voice sample relative to a standard text, parameters of the performance evaluation index comprise a recognition quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly recognized, the recognition quantization parameter comprises recognition correct quantization data, when the statistical object is incorrectly recognized, the recognition quantization parameter comprises edit quantization data of an editing operation, and the semantic difference parameter is used for correcting the edit quantization data.
According to another aspect of the present disclosure, there is provided a training method including:
predicting a predicted text of the speech sample by using the speech recognition model;
updating model parameters of the voice recognition model in response to the performance evaluation index of the voice recognition model meeting an iteration condition;
the performance evaluation index is used for measuring the error rate of a predicted text of a voice sample relative to a standard text, the parameters of the performance evaluation index comprise an identification quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is identified correctly, the identification quantization parameter comprises identification correct quantization data, when the statistical object is identified incorrectly, the identification quantization parameter comprises editing quantization data of editing operation, and the semantic difference parameter is used for correcting the editing quantization data.
According to another aspect of the present disclosure, there is provided a voice recognition apparatus including:
the acquisition module is used for acquiring target voice;
the recognition module is used for determining a recognition text of the target voice based on a voice recognition model, and the voice recognition model is obtained by training in the following way:
predicting a predicted text of a voice sample by using the voice recognition model, updating model parameters of the voice recognition model in response to that a performance evaluation index of the voice recognition model meets an iteration condition, wherein the performance evaluation index is used for measuring the error rate of the predicted text of the voice sample relative to a standard text, parameters of the performance evaluation index comprise a recognition quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly recognized, the recognition quantization parameter comprises recognition correct quantization data, when the statistical object is incorrectly recognized, the recognition quantization parameter comprises editing quantization data of editing operation, and the semantic difference parameter is used for correcting the editing quantization data.
According to another aspect of the present disclosure, there is provided a training apparatus comprising:
the prediction module is used for predicting the prediction text of the voice sample by utilizing the voice recognition model;
the updating module is used for responding to that a performance evaluation index of the voice recognition model meets an iteration condition, updating model parameters of the voice recognition model, wherein the performance evaluation index is used for measuring the error rate of a predicted text of a voice sample relative to a standard text, parameters of the performance evaluation index comprise an identification quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly identified, the identification quantization parameter comprises identification correct quantization data, when the statistical object is incorrectly identified, the identification quantization parameter comprises editing quantization data of an editing operation, and the semantic difference parameter is used for correcting the editing quantization data.
According to another aspect of the present disclosure, there is provided an electronic device including:
a processor; and (c) a second step of,
a memory storing a program;
wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to an exemplary embodiment of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform a method according to exemplary embodiments of the present disclosure.
In one or more embodiments provided in the exemplary embodiments of the present disclosure, when the speech recognition model is in the training stage, when an object recognition error is counted, the recognition quantization parameter includes editing quantization data of an editing operation, and the semantic difference parameter is used to correct the editing quantization data, therefore, when a performance evaluation index is determined based on the editing quantization data and the recognition correct quantization data, the editing quantization data may be corrected using the semantic difference data, so that the corrected editing quantization data may reflect not only an objective difference between the predicted text and the standard text, but also a subjective difference between the predicted text and the standard text from the perspective of semantic understanding. Based on this, after the performance evaluation index is determined based on the corrected edited quantized data and the recognized correct quantized data, when the error rate of the predicted text of the voice sample relative to the standard text is evaluated by using the performance evaluation index, the performance evaluation index can distinguish the influence of the semantic deviation on the recognition result, so that the performance evaluation index used by the method of the exemplary embodiment of the disclosure can differentially measure the influence of the predicted text on the understanding of the user, and the recognized text with large semantic deviation is prevented from being mistakenly regarded as the correct text to be output.
Drawings
Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 shows a schematic diagram of an example system of various methods described by an example embodiment of the present disclosure;
FIG. 2 illustrates an example flow chart of a training method of an example embodiment of the present disclosure;
FIG. 3 illustrates an example flow chart of a speech recognition method of an example embodiment of the present disclosure;
FIG. 4 is a diagram illustrating the effect of sentiment offset and semantic offset on the misread rate of an exemplary embodiment of the present disclosure;
FIG. 5 shows a schematic block diagram of functional modules of a speech recognition apparatus of an exemplary embodiment of the present disclosure;
FIG. 6 shows a schematic block diagram of functional modules of a training apparatus according to an exemplary embodiment of the present disclosure;
FIG. 7 shows a schematic block diagram of a chip according to an example embodiment of the present disclosure;
FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Before describing the embodiments of the present disclosure, the related terms referred to in the embodiments of the present disclosure are first explained as follows:
speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims at converting the vocabulary content of human Speech into computer-readable input
The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain standard, and comprises English word segmentation and Chinese word segmentation, wherein during English word segmentation, a blank space is used as a natural delimiter between English words, and no delimiter exists in a Chinese sentence in a form. The Chinese character sequence is cut into meaningful words, namely Chinese word segmentation. The result of the segmentation of text by the exemplary embodiments of the present disclosure is referred to as word segmentation.
The edit distance is proposed by the soviet union mathematician fra Mi Er leigh Wen Sitan in 1965, and describes the difference between two character strings by calculating the minimum edit number required for the mutual conversion of the two character strings, wherein the edit operation includes substitution, deletion, and insertion, and is currently widely used in the fields of word error rate calculation, deoxyribonucleic acid (DNA) sequence alignment, spelling detection, and the like.
The Character Error Rate (CER) is used to evaluate the Error Rate between predicted text and standard text. Word Error Rate (WER) is an important index for evaluating ASR performance, and is used to evaluate the Word Error Rate between predicted text and standard text.
Part-of-speech tagging, also known as part-of-speech tagging or tagging for short, refers to a procedure for tagging each word in the segmentation result with a correct part-of-speech, i.e., a process for determining whether each word is a noun, verb, adjective, or other part-of-speech. This process is done by the algorithm of the part-of-speech tagger.
Text Sentiment Analysis (Sentiment Analysis), also known as opinion mining, refers to a process of analyzing, processing and extracting subjective text with Sentiment colors by using natural language processing and text mining technologies.
The back propagation algorithm is a network parameter for optimizing the neural network by using a gradient descent method, and is characterized in that the value of a loss function is calculated according to the value calculated by the neural network and an expected value, then the partial derivative of the loss function on the model parameter is calculated, and finally the network parameter is updated.
The model parameters include a weight parameter representing a slope of the hyperplane and a bias parameter representing an intercept of the hyperplane.
The disclosed exemplary embodiment provides a speech recognition method and a training method, the performance evaluation index of the speech recognition model used in the training stage can not only measure the recognition error rate of the predicted text of a speech sample relative to a standard text, but also can reflect the influence of semantic interference on the recognition error rate in a differentiation manner, thereby ensuring that the influence of the predicted text in the differentiation manner on the understanding of a user is measured, and avoiding that the recognized text with large semantic deviation is mistakenly regarded as a correct text to be output.
Fig. 1 shows a schematic diagram of a system architecture exemplified by a method provided according to an exemplary embodiment of the present disclosure. As shown in fig. 1, a system architecture 100 provided by the exemplary embodiment of the present disclosure includes: user device 110, execution device 120, and data storage system 130.
As shown in fig. 1, the user device 110 may communicate with the execution device 120 through a communication network. The communication network may be a wired communication network or a wireless communication network. The limited communication network may be a communication network based on power line carrier technology, and the wireless communication network may be a local area wireless network or a wide area wireless network. The local wireless network can be a WIFI wireless network, a Zigbee wireless network, a mobile communication network or a satellite communication network and the like.
As shown in fig. 1, the user device 110 may include an intelligent terminal such as a computer, a mobile phone, or an information processing center, and the user device 110 may serve as an initiating terminal of voice recognition to initiate a request to the execution device 120. The execution device 120 may be a server having a data processing function, such as a cloud server, a web server, an application server, and a management server, to implement the voiceprint recognition method. The server may be configured with a Deep Learning Processor, which may be a single core Deep Learning Processor (DLP-S) or a multi-core Deep Learning Processor (DLP-M). The DLP-M is multi-core expansion based on the DLP-S, and a plurality of DLP-S are subjected to inter-core communication through protocols such as interconnection, multicast, inter-core synchronization and the like through a Network-on-chip (Noc).
As shown in fig. 1, the data storage system 130 may be a general term, and includes a database for locally storing and storing historical data, which may be on the execution device 120, on other network servers, or on the data storage system 130. The data storage system 130 may be separate from the execution device 120 or may be integrated within the execution device 120. The data storage system 130 may not only input data uploaded by the user device 110, but may also store program instructions, neuronal data, etc., which may be trained data. In addition, the data storage system 130 may also store processing results (such as preprocessing target speech to be recognized, intermediate processing results, or recognized text) obtained by processing by the execution device 120, and the like, in the data storage system 130.
In practical applications, as shown in fig. 1, the user equipment 110 may have a voice collecting function, so that the user equipment 110 may not only initiate a request to the execution equipment 120 through the interactive interface, but also collect target voice to be recognized, and send the target voice to be recognized to the execution equipment 120 through a communication network. Based on this, when the execution apparatus 120 implements the voice recognition method, the recognized target voice can be acquired not only from the data storage system 130 but also from the user apparatus 110 through the communication network. In addition, when the execution device 120 implements the voice recognition method, the voiceprint recognition result thereof may be not only fed back to the user device 110 through the communication network, but also stored in the data storage system 130.
In the related art, the speech recognition technology has been widely applied to various speech interaction scenarios, and the speech recognition algorithm often uses performance evaluation indexes such as a word error rate and a word error rate to measure the performance of the speech recognition model, and the smaller these performance evaluation indexes are, the better the performance of the speech recognition model is.
In consideration of the difference of languages, the same performance evaluation index may have a certain difference in performance evaluation of speech recognition models of different languages. For example: for English, the misword rate is the same as the misword rate, and both are based on word unit, and the misword rate of the predicted text relative to the standard text is evaluated. The word error rate is the error rate of the predicted text relative to the standard text evaluated by taking Chinese word segmentation as a unit.
In practical application, the calculation formulas of the word error rate and the word error rate are the same, and both the calculation formulas can predict the editing quantization parameter between the text and the standard text, namely the number of editing operations by using the editing distance statistics. For example, the number of editing operations may include the number of replacement operations, the number of deletion operations, and the number of insertion operations, and then the performance evaluation index may be determined using equation one.
The performance evaluation index can be a character error rate or a word error rate, and under a Chinese speech recognition scene, when the performance evaluation index is the character error rate, a single Chinese character can be taken as a statistical object (one punctuation mark can be regarded as a statistical object taking the Chinese character as a unit), and an editing quantization parameter between a predicted text and a standard text is counted; when the performance evaluation index is the word error rate, the editing quantization parameter of the editing operation of the predicted text relative to the standard text can be counted by taking the Chinese phrase as a statistical object (one punctuation mark can be considered as a statistical object taking the phrase as a unit).
Figure BDA0003805272330000061
In the exemplary embodiment of the present disclosure, the WER specifies a performance evaluation index, and does not distinguish between performance evaluation indexes of statistical objects having a minimum unit of words or a minimum unit of words, S represents a minimum number of replacements (that is, a number of replacement operations) that occur when a predicted text is transcribed into a standard text, or a minimum number of statistical objects that need to be replaced when the predicted text is edited into the standard text, D represents a minimum number of deletions (that is, a number of deletion operations) that occur when the predicted text is transcribed into the standard text, or a minimum number of statistical objects that need to be deleted when the predicted text is edited into the standard text, I represents a minimum number of insertions (that is, a number of insertion operations) that occur when the predicted text is transcribed into the standard text, or a minimum number of statistical objects that need to be inserted when the predicted text is edited into the standard text, and C represents a number of words that the predicted text correctly identifies, or a maximum number of statistical objects that need not change in the predicted text.
When the statistical object is a word, S may represent the minimum number of words that need to be replaced when the predicted text is edited into the standard text, D may represent the minimum number of words that need to be deleted when the predicted text is edited into the standard text, I may represent the minimum number of words that need to be inserted when the predicted text is edited into the standard text, and C may represent the maximum number of words that need not be changed in the predicted text.
From the first expression, when the first expression is used for evaluating performance evaluation indexes such as the word error rate and the word error rate, the performance evaluation indexes can reflect the recognition error rate of the statistical object contained in the predicted text, but the performance evaluation indexes are difficult to evaluate the error rate of the predicted text representing the voice sample to the standard text in a true and fair manner in some cases.
When the statistical objects contained in the predicted text are identified incorrectly, some identification errors can be ignored, some statistical objects identified incorrectly may cause deviation in subjective understanding of the user, and the deviation in understanding may cause the user to understand the meaning of the predicted text incorrectly, even cause the user to generate unnecessary negative emotions, thereby affecting the user understanding experience.
The performance evaluation index adopted by the speech recognition model in the training stage related to the speech recognition method provided by the exemplary embodiment of the disclosure is the improved performance evaluation index, and the semantic difference parameter is introduced along with the subjective judgment of the user, so that the performance evaluation index can fully reflect the performance influence brought by the semantic difference to the speech recognition model, thereby ensuring that the trained semantic recognition model can differentially measure the influence brought by the predicted text to the understanding of the user, and avoiding that the recognized text with larger semantic deviation is mistakenly regarded as the correct text to be output.
The speech recognition model related to the speech recognition method provided by the exemplary embodiment of the present disclosure may be trained by the server in the execution device. Based on this, the exemplary embodiments of the present disclosure also provide a training method, which may be performed by a server or a chip in the server. For convenience of understanding, the exemplary embodiments of the present disclosure describe the training method of the exemplary embodiments of the present disclosure in a manner that a user equipment interacts with a server.
Fig. 2 illustrates an example flow diagram of a training method of an example embodiment of the present disclosure. As shown in fig. 2, a method of an exemplary embodiment of the present disclosure may include:
step 201: the user equipment sends the voice sample to the server. The user equipment can be used as a voice sample collection end to collect voice samples, and upload the collected voice samples to a server included in the execution equipment through a communication network.
Step 202: the server predicts a predicted text of the speech sample using the speech recognition model. For example: and the server reads the neuron data of the voice recognition model stored in the data storage system, and finally obtains a prediction sample of the voice sample by taking the voice sample as the input of the neuron data.
Step 203: and the server responds to the condition that the performance evaluation index of the voice recognition model meets the iteration condition, and updates the model parameters of the voice recognition model. The performance evaluation index of the exemplary embodiment of the present disclosure may be used to measure the error rate of the predicted text of the speech sample relative to the standard text, and may be obtained by the following way:
the server carries out performance evaluation index parameter statistics based on the prediction text and the standard text of the voice sample to obtain parameters of the performance evaluation indexes, and the performance evaluation indexes are determined based on the parameters of the performance evaluation indexes.
The standard text of the exemplary embodiment of the present disclosure is a reference text of a speech sample, which can be regarded as a real text of the speech sample, and it may be a manually labeled standard text, or a recognition text output by a trained speech recognition model, and the recognition text is proved to match with the input speech.
In practical application, the model parameters of the speech recognition model are updated in response to the performance evaluation index satisfying the iteration condition. The iteration condition may be that the performance evaluation index is smaller than a preset performance evaluation parameter, if the performance evaluation index is greater than or equal to the preset performance evaluation index, it indicates that the performance evaluation index satisfies the iteration condition, a back propagation algorithm may be used to update a model parameter of the speech recognition model, and the model parameter may be a weight, may also include an offset value, or includes both a weight and an offset value.
For example, the parameters of the performance evaluation index of the exemplary embodiment of the present disclosure may include an identification quantization parameter and a semantic difference parameter of a statistical object included in the predicted text, and the semantic difference parameter may correct the identification quantization parameter, so as to ensure that the performance evaluation index determined based on the corrected identification quantization parameter may differentially measure the influence of the predicted text on the understanding of the user, and avoid that the identified text with a large semantic deviation is mistakenly regarded as a correct text for output.
For example: identifying the quantization parameter includes identifying correct quantization data when the statistical object is identified correctly, identifying the quantization parameter includes editing quantization data for an editing operation when the statistical object is identified incorrectly, and the semantic difference parameter is used to correct the editing quantization data.
The voice recognition method of the exemplary embodiment of the present disclosure may be performed by an electronic device, which may be a server or a user equipment, or a chip in the electronic device. When the electronic equipment is the server, the voice recognition model is deployed at the server side, and when the electronic equipment is the user equipment, the voice recognition model is deployed at the user equipment. The method of the present disclosure is described below in conjunction with the figures. It should be understood that the related contents of the training method and the speech recognition method of the exemplary embodiments of the present disclosure may be mutually referred to, and as for the specific processes of the training method and the speech recognition method, reference may be made to the related art, which is not described as an important point,
fig. 3 shows a flowchart of a speech recognition method of an exemplary embodiment of the present disclosure. As shown in fig. 3, a speech recognition method of an exemplary embodiment of the present disclosure may include:
step 301: and acquiring the target voice. The target voice can be collected through a voice collector of the user equipment, the user equipment can preprocess the collected target voice, and the preprocessing can include: and sampling, quantizing and coding the analog signal of the target voice at equal intervals, thereby obtaining a digital signal of the target voice. Meanwhile, before the analog signal of the target voice is converted into the digital signal, the target voice can be filtered, and unnecessary interference introduced to voice recognition is avoided. On the basis, the digital signal of the target voice can be processed by pre-emphasis processing, framing, windowing, end point detection and the like,
step 302: determining a recognition text of the target voice based on a voice recognition model, wherein the voice recognition model can be obtained by training in the following way: the method comprises the steps of predicting a predicted text of a voice sample by using a voice recognition model, responding to that a performance evaluation index of the voice recognition model meets an iteration condition, and updating model parameters of the voice recognition model, wherein the performance evaluation index can be used for measuring the error rate of the predicted text of the voice sample relative to a standard text, and the parameters of the performance evaluation index comprise a recognition quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text. It should be understood that the speech recognition model can be various models that can implement speech recognition, such as a chinese language model (also called n-gram model), a recurrent neural network, etc., and the specific architecture can refer to the related art, which is not described in detail herein.
The statistical objects included in the predictive text of the exemplary embodiments of the present disclosure may be one or more, and are specifically related to the length of the predictive text.
When the number of the statistical objects is one, the predicted text may include an identification quantization parameter and a semantic difference parameter of one statistical object. When the number of the statistical objects is multiple, the predictive text may include an identification quantization parameter and a semantic difference parameter of the multiple statistical objects.
Taking a Chinese speech recognition scenario as an example, the predicted text in the training phase is "did you eat? "when the unit of the statistical object is Chinese characters, the statistical object is 7 Chinese characters, which are" today "," day "," you "," eat "," done "," do ", and"? ", each Chinese character or punctuation thereof includes an identification quantization parameter and a semantic difference parameter. When the unit of the statistical object is a phrase, the statistical object, that is, the number of phrases is 5, which are "today", "you", "eaten", "do", and "? ". Each phrase or punctuation includes a recognition quantization parameter and a semantic difference parameter.
In the training phase, the statistical objects contained in the predictive text are either correctly or incorrectly identified. The identifying quantization parameter may include identifying correct quantization data when the statistical object is correctly identified, the identifying quantization parameter may include editing quantization data of an editing operation when the statistical object is incorrectly identified, and the semantic difference parameter may be used to correct the editing quantization data.
For example, the identification correct quantized data of the exemplary embodiments of the present disclosure may be a word count for which the statistical identification is correct, or a maximum number of statistical objects that do not need to be changed in the predicted text, when the statistical object identification is performed. The edited quantized data of the exemplary embodiments of the present disclosure may refer to the foregoing description about the edited quantization parameter or the number of editing operations.
On the basis, in the process of determining the performance evaluation index based on the edited quantized data and the identified correct quantized data, the edited quantized data can be corrected by utilizing semantic difference data, so that the corrected edited quantized data can reflect the objectivity difference between the predicted text and the standard text and can reflect the subjectivity difference between the predicted text and the standard text from the aspect of semantic understanding. Based on this, after the performance evaluation index is determined based on the corrected edited quantized data and the recognized correct quantized data, when the error rate of the predicted text of the voice sample relative to the standard text is evaluated by using the performance evaluation index, the performance evaluation index can distinguish the influence of the semantic deviation on the recognition result, so that the method of the exemplary embodiment of the disclosure can differentially measure the influence of the predicted text on the understanding of the user, and the recognized text with large semantic deviation is prevented from being mistakenly regarded as the correct text to be output.
In practical application, the performance evaluation index of the exemplary embodiment of the disclosure is determined by the first parameter and the second parameter, the performance evaluation index is positively correlated with the first parameter, and the performance evaluation index is negatively correlated with the second parameter. For example: the performance evaluation index may be equal to a ratio of the first parameter and the second parameter. Assuming that the first parameter is M and the second parameter is N, the performance evaluation index = M/N, where M, N are all integers greater than or equal to 1.
The first parameter and the second parameter of the exemplary embodiment of the present disclosure may both be positively correlated with the semantic difference parameter, and the first parameter may be set to be determined by editing the quantized data and the semantic difference parameter, and the second parameter may be determined by editing the quantized data, identifying the correct quantized data, and the semantic difference parameter.
In a possible implementation manner, the manner of correcting and editing the quantized data by the semantic difference parameter of the exemplary embodiment of the present disclosure may be an additive manner (i.e., an additive manner), in this case, the performance evaluation index introduces the semantic difference parameter in an additive manner on the basis of the first expression, the semantic difference parameter may correct and edit the quantized data by the semantic difference parameter in an additive manner, and then determine the first parameter by using the corrected and edited quantized parameter, the semantic difference parameter may correct and edit the quantized data and identify correct quantized data by using the semantic difference parameter in an additive manner, and determine the second parameter based on the corrected and edited quantized data and the corrected and identified correct quantized data. At this time, the performance evaluation index satisfies the formula two:
Figure BDA0003805272330000091
E general assembly The sum of the semantic difference parameters of the editing objects representing the respective editing operations, which may be considered as a penalty term, is present in equation two.
In practical application, considering that it is difficult to judge the semantic meaning to be expressed by a single Chinese character, if the statistical object is the statistical object with the character as the minimum unit, the editing objects belonging to the same participle share the semantic difference parameter. The statistical object is a statistical object taking the participle as the minimum unit, and semantic difference parameters of the editing objects belonging to different participles can be independent of each other.
For example, the editing object of the exemplary embodiment of the present disclosure may refer to an editing object involved before and after an editing operation, based on which the editing object may include a pre-editing object and a post-editing object. The essence of the editing operation may include a replacement operation, a deletion operation, and an insertion operation, which aims to transfer the predicted text to the standard text, and therefore, the participles to which the edited object belongs may include a participle to which the edited object belongs in the predicted text before editing and a participle to which the edited object belongs in the standard sample.
When the editing operation is a replacement operation, the editing object may include a pre-replacement object and a post-replacement object. The object before replacement may be a statistical object that needs to be replaced in the predicted text, and at this time, the participle to which the object before replacement belongs may be a participle to which the statistical object that needs to be replaced after the word segmentation operation of the predicted text belongs. The replaced object may be replacement content in the predicted text that replaces the statistical object. At this time, the participle to which the object belongs after the replacement is substantially the participle to which the object belongs after the standard text is subjected to the participle operation.
For example, the predicted text hyp = "which play i say experience we go? "whose word segmentation results in [ ' I ', ' say ', ' experience ', ' We ', ' go ', ' where ', ' play ', '? ' ], standard text ref = "what did i say we go to play this night? "whose word segmentation results in [ ' I ', ' say ', ' this night ', ' we ', ' go ', ' where ', ' play ', '? '].
If the statistical object is a statistical object with a word as the minimum unit, the comparison between the predicted text hyp and the standard text ref is known, the editing object of the editing operation comprises objects before replacement including "pass" and "experience", the objects after replacement include "present" and "night", the "pass" needs to be replaced by the "present", the "experience" needs to be replaced by the "night", it can be seen that the predicted text hyp needs to be edited into the standard text ref through two replacement operations, the participles of the objects before replacement in the first replacement operation and the second replacement operation are both "experience", the participles of the objects after replacement in the first replacement operation and the second replacement operation are both "present", and therefore, the "pass" and the "experience" share the same semantic difference parameter.
If the statistical object is a statistical object with words as the minimum unit, the word segmentation result of the predicted text hyp and the word segmentation result of the standard text ref are compared, and the experience in the predicted text hyp needs to be replaced by the experience in the standard text ref, so that the predicted text hyp is edited into the standard text ref. Based on this, the number of times of the replacement operation is 1, and the editing object of the replacement operation includes a pre-replacement object of "experience", a post-replacement object of "this night", a participle to which the pre-replacement object of "experience" belongs is "experience", and a participle to which the post-replacement object of "this night" belongs.
As for the deletion operation and the insertion operation, since the deletion operation is directed to the statistical object that needs to be deleted and is contained in the predicted text, the deletion object may include a pre-deletion object, that is, the statistical object that needs to be deleted and is contained in the predicted text. In the inserting operation, since the inserting operation is performed on the content that needs to be inserted into the predicted text and is contained in the standard sample, the inserted object may include an inserted object, that is, the content that needs to be inserted into the predicted text and is contained in the standard sample.
For example, when the quantized data is edited by correcting the semantic difference parameter according to the exemplary embodiments of the present disclosure, the quantized data may be edited by correcting the semantic difference parameter with emphasis on the attribute. Also, when correcting the edited quantized data, the edited quantized data may be corrected with emphasis on a single angle or a plurality of angles.
In practical application, the influence caused by recognition error can be ignored sometimes, but cannot be ignored sometimes, so that the user understanding has serious negative effects, such as semantic deviation, emotion deviation and other semantic understanding differences. Based on this, the semantic difference parameter of the exemplary embodiment of the present disclosure may introduce an influence caused by semantic offset for editing the quantization parameter from the perspective of syntactic analysis, may also introduce an influence caused by emotion offset for editing the quantization parameter from the perspective of emotion analysis, and may also introduce an influence caused by semantic offset and emotion offset for editing the quantization parameter from the perspective of syntactic analysis and emotion analysis.
Taking the word error rate as an example, the first table shows the influence of the emotion offset and the semantic offset on the word error rate. FIG. 4 is a diagram illustrating the effect of sentiment offset and semantic offset on the error rate of words in an exemplary embodiment of the present disclosure. It should be understood that WER in Table one represents the word error rate, and the more the plus sign after WER, the higher the word error rate.
TABLE-table of influence of emotion bias and semantic bias on misrepresentation rate
Figure BDA0003805272330000111
In fig. 4, curve a represents the influence curve of semantic shift on the word error rate under the condition that the emotion has no shift, curve b represents the influence curve of semantic shift on the word error rate under the condition that the emotion has a proper amount of shift, and curve c represents the influence curve of semantic shift on the word error rate under the condition that the emotion is extremely shifted. As can be seen from the curves a, b, and c in table 1 and fig. 4, for the phrase with a certain semantic shift degree, the larger the emotional shift degree is, the higher the corresponding error rate is, and for the phrase with a certain emotional shift degree, the larger the semantic shift degree is, the higher the corresponding error rate is.
In an alternative manner, the semantic difference parameter of the exemplary embodiment of the present disclosure may include the editing importance of the editing operation when introducing the influence of semantic offset to the editing quantization parameter from the perspective of syntactic analysis. The editing importance of the editing operation can reflect the degree of understanding error brought by semantic error to the understanding of the user.
The editing importance of the editing operation and the semantic definition level of the participle to which the editing object of the editing operation belongs are positively correlated. That is, when the semantic restriction of the word to which the incorrectly recognized statistical object belongs is larger, the importance of the word to the statistical object is higher.
The semantic definition level of the exemplary embodiment of the present disclosure may be determined by the part-of-speech to which the wrongly identified statistical object belongs. Based on the method, the word segmentation can be carried out on the predicted text to obtain the word segmentation, and the number of the word segmentation is related to the word segmentation mode and the length of the predicted text. From the technical point of view, the word segmentation mode can be a dictionary-based word segmentation mode, a statistic-based word segmentation mode, a rule-based word segmentation mode and the like. From the perspective of the word segmentation tool, the word segmentation modes can be divided into LAC word segmentation, jieba word segmentation and the like. Meanwhile, the word segmentation tool can label the part of speech of the segmented word, so that the part of speech of each segmented word is obtained.
In one example, the relationship of the part-of-speech of a word to the semantic restriction level may be determined in a rule-based manner. For example: a first mapping relationship of semantic definition level and part-of-speech of the participle may be set. For the statistical object with the recognition error, the corresponding semantic definition level can be found from the first mapping relation based on the word segmentation part of speech to which the statistical object belongs. It should be understood that a semantic definition level may correspond to one part-word or may correspond to a plurality of part-word parts-words.
For example, a 4-Level gradient may be used to classify the table to set the mapping relationship between the semantic definition Level and the word segmentation included in the first mapping relationship. And the second table shows a 4-Level gradient classification table of semantic definition Level and word segmentation part of speech. And the part-of-speech labels shown in the second table are compared with the part-of-speech meaning of the exemplary reference LAC participle.
Table two semantic definition Level and 4-Level gradient classification table of participle part of speech
Figure BDA0003805272330000121
The 4-Level gradient classification table of the semantic restriction Level and the word part of speech of the participle shown in the table two defines the corresponding relation between the semantic restriction Level and the word part of speech of the participle, the semantic restriction Level is divided into four levels, and each Level of semantic restriction Level corresponds to one participle category. The lower the semantic definition level is, the smaller the index of the semantic definition level is, and the lower the semantic definition degree of the participle category is.
For example: the participle category corresponding to the first-level semantic definition level is a redundant word which has almost no semantic definition degree, so that the semantic definition level quantization value of the first-level semantic definition level is 0; the participle category corresponding to the second-level semantic definition level is a weak definition word which slightly defines the semantic meaning, so that the quantitative value of the semantic definition level of the second-level semantic definition level is 1; the participle category corresponding to the third-level semantic definition level is a strong qualifier, which has a relatively obvious definition on the semantic meaning, so that the semantic definition level quantization value of the third-level semantic definition level is 2; the participle category corresponding to the fourth semantic restriction level is a core word which is particularly obvious for semantic restriction, so that the quantization value of the semantic restriction level of the fourth semantic restriction level is 3. Therefore, the higher the semantic definition level is, the higher the corresponding semantic definition level quantization value is, and the positive correlation between the semantic definition level quantization value and the corresponding semantic definition level quantization value is shown. Since the editing importance of the editing operation is positively correlated with the semantic definition level of the participle to which the editing object of the editing operation belongs, the editing importance of the semantic definition level editing operation can be determined by the semantic definition level quantization value.
In one example, the part-of-speech and semantic restriction level relationship may be determined based on neural network model recognition. The Neural Network model may be a transform Neural Network model based on the self-attention mechanism, or may be a Recurrent Neural Network (RNN) or a Long-Short Term Memory Network (LSTM).
In the training stage, the segmentation semantics (namely segmentation results), the segmentation part of speech and the labeled quantized value of the semantic definition level can be used as the input of the neural network model, the neural network model is used for determining the quantized predicted value of the semantic definition level based on the segmentation semantics and the segmentation part of speech, then the loss function is used for determining the quantized predicted value and the loss of the labeled quantized value, if the loss is less than the preset loss, the training of the neural network model is ended, otherwise, the model parameters of the neural network model need to be updated by adopting a back propagation algorithm. The loss function may be selected according to practical situations, and is not limited herein.
In the reasoning stage, a plurality of word segmentation results included in a text word segmentation result (predicted text or standard text) and word segmentation results thereof can be constructed into a word segmentation coding sequence, and each word segmentation code of the word segmentation coding sequence includes a word segmentation and a part of speech of the word segmentation. Taking a recurrent neural network or a long-short term memory network as an example, inputting the participle coding sequence into the recurrent neural network, analyzing long-term dependence between participle codes contained in the participle coding sequence by using the recurrent neural network, and determining a quantized value of semantic definition degree of the participle contained in each participle code by using the long-term dependence.
For example, for "do i say where we go this night? ", its word segmentation results [ ' I ', ' say ', ' this night ', ' we ', ' go ', ' where ', ' play ', '? '].
When determining the relation between the part-of-speech of a participle and the semantic definition level based on a rule-based manner, the participle result [ ' I ', ' say ', ' this night ', ' we ', ' go ', ' where ', ' play ', '? ' ] the semantic qualifier quantization value of each participle contained, the concrete result is referenced to Table three.
TABLE III quantized values of word segmentation limit levels determined in a rule-based manner
Word segmentation content Word segmentation part-of-speech label Quantized values of semantically defined levels
I am concerned with r 1
Say that v 2
This night t 3
We have found that r 1
To get rid of v 2
Which one is r 1
Playing with v 2
w 0
When the relation value between the word property of the segmented word and the semantic restriction level is determined based on the recognition mode of the neural network model, the word result and the word property of the word result can be used as the input value of the neural network model, and the semantic restriction level quantization value of each segmented word contained in the word result can be predicted.
Taking the recurrent neural network as an example, based on the word segmentation result [ ' I ', ' say ', ' this night ', ' We ', ' go ', ' where ', ' play ', '? ' ] constructing participle coding sequence { x1, x2, x3, x4, x5, x6, x7, x8}, wherein x1 is participle code of "I", x2 is participle code of "say", x3 is participle code of "this night", x4 is participle code of "go", x5 is participle code of "where", x6 is participle code of "we", x7 is participle code of "play", x8 is "? "is encoded by word segmentation. The word segmentation codes comprise word segmentation and part of speech corresponding to the word segmentation. Table four shows the semantic definition level quantization values of the respective segmented words determined based on the neural network model recognition manner.
TABLE IV semantic definition level quantization value of each participle determined based on neural network model recognition
Word segmentation content Word segmentation part-of-speech label Quantized values of semantically defined levels
I am concerned with r 1
Say that v 3
This night t 3
We have found that r 1
To v 1
Where is r 1
Playing with v 2
w 0
Comparing the table three with the table four can find that under the condition that the word segmentation results are the same, the semantic definition level quantization values of all the word segmentations determined based on a rule mode and a neural network model identification mode have certain difference. This is because, when determining the semantic definition level quantization value of a participle based on the neural network model recognition, not only the semantics and part of speech of "say" but also the "where we go to today" are considered for the participle? "long term dependencies of this sentence, thereby making the semantic definition level quantization value of a participle more close to its" where i say we go today? "the actual semantics of this sentence define the level.
When the editing importance of the editing operation and the semantic definition level of the participle to which the erroneous statistical object belongs are positively correlated, the positive correlation can be linear positive correlation or non-linear positive correlation. For example: p represents the editing importance of editing operation, lev represents the semantic definition level of the participle to which the statistical object with the recognition error belongs, lev serves as an independent variable, P serves as a dependent variable, and the relation between P and Lev meets the relation of a quadratic function or a high-order function. The influence of the semantic definition level on the editing importance can be amplified by utilizing the quadratic function or high-order function relation, and based on the influence, when the editing importance is utilized to correct and edit the quantized data, the edited quantized data can reflect the influence caused by semantic deviation more easily, so that the semantic deviation error can be reflected easily through a performance evaluation index under the condition of semantic deviation of the predicted text. The editing importance of the editing operation will be described below according to the type of the editing operation.
When the editing operation is a replacement operation, the editing importance P of the ith replacement operation Si The formula three is satisfied:
Figure BDA0003805272330000151
lev(w Si ) Semantic definition level quantization value, lev (w), representing the word to which the pre-replacement object of the ith replacement operation belongs Si ) ' a semantic definition level quantization value of an object after replacement of an ith replacement operation, n a total number of stages of the semantic definition level, i a number of replacement operations required to transfer the predicted text to the standard text, and i a total number of times of the replacement operations greater than or equal to 0 and less than or equal to. As can be seen, the sum of the edit importance of all replacement operations
Figure BDA0003805272330000152
In practical applications, the total number of the replacement operations may be determined according to the difference between the predicted text and the standard text. For example: when the total number of replacement operations is equal to 0, it indicates that all the statistical objects contained in the predicted text do not need to be replaced. And when the total number of the replacing operations is equal to the total number of the statistical objects contained in the predicted text, the statistical objects contained in the predicted text need to be replaced. As can be seen from equation three, the editing importance P of the ith replacement operation of the exemplary embodiment of the present disclosure Si Substantially taking into account semantic definition levels of the object before replacement and the object after replacement and averaging the semantic definition levels so as to more comprehensively evaluate the editing importance P of the ith replacement operation Si And further improve the accuracy of the editing importance of the replacement operation.
When the editing operation is a replacement operation, the editing importance P of the jth deletion operation Dj The formula four is satisfied:
Figure BDA0003805272330000153
lev(w Dj ) And representing the semantic definition level quantization value of the participle to which the deletion object of the j-th deletion operation belongs, n represents the total series of the semantic definition levels, j is the deletion operation serial number required for transferring the predicted text into the standard text, and j represents the total number of times of the deletion operation which is greater than or equal to 0 and less than or equal to the total number of times of the deletion operation. As can be seen, the sum of the edit importance of all delete operations
Figure BDA0003805272330000154
In practical applications, the total number of the deleting operations can be determined according to the difference between the predicted text and the standard text. For example: when the total number of the deleting operations is equal to 0, all the statistical objects contained in the predicted text do not need to be deleted. Considering that there is no possibility that all the statistical objects are deleted, the total number of deletion operations is smaller than the total number of statistical objects contained in the predicted text.
When the editing operation is an insert operation, the editing importance P of the kth insert operation Ik The formula four is satisfied:
Figure BDA0003805272330000155
lev(w Ik ) The method comprises the steps of representing a semantic definition level quantization value of a participle to which an insertion object of a kth insertion operation belongs, n representing a total series of semantic definition levels, k representing an insertion operation serial number required for transferring a predicted text into a standard text, and k representing the total number of times of the insertion operation, wherein the k is greater than or equal to 0 and less than or equal to the total number of times of the insertion operation. k is the insertion operation serial number required for transferring the predicted text into the standard text, and k is greater than or equal to 0 and less than or equal to the total number of insertion operations. As can be seen, the sum of the edit importance of all insert operations
Figure BDA0003805272330000161
When the total number of the insertion operations is equal to 0, it is indicated that all the statistical objects contained in the standard text do not need to be inserted into the predicted text. Considering that there is no possibility that all the statistical objects contained in the standard text need to be inserted into the predicted text, the total number of the insertion operations is smaller than the total number of the statistical objects contained in the standard text.
When the performance evaluation index measures the error rate of the predicted text of the voice sample relative to the standard text, if the editing quantized data is corrected by using the editing importance of the editing operation, the editing importance of the replacing operation can be regarded as a penalty item of the minimum replacing number S, the editing importance of the deleting operation can be regarded as a penalty item of the minimum deleting number D, and the editing importance of the inserting operation can be regarded as a penalty item of the minimum inserting number I. For example, if only the quantized data is corrected and edited with the editing importance of the editing operation being taken into consideration, E in equation two General (1) The formula six can be satisfied:
Figure BDA0003805272330000162
when the formula six is substituted into the formula two, the performance evaluation index WER can meet the formula seven when only the editing importance degree of the editing operation is considered to correct and edit the quantized data:
Figure BDA0003805272330000163
if the predicted text is transferred to the standard text, if no replacement operation is performed, the formula seven
Figure BDA0003805272330000164
If no delete operation is performed, equation seven
Figure BDA0003805272330000165
If no insertion operation is performed, formula VII
Figure BDA0003805272330000166
In an alternative, the semantic difference parameter of the exemplary embodiment of the present disclosure introduces the influence of emotion offset to the editing quantization parameter from the perspective of emotion analysis, and the semantic quantization parameter may include emotion difference data of the editing operation. The emotion difference data of the editing operation can reflect negative emotion brought to the understanding of the user by semantic errors.
The emotion difference data of the editing operation includes a difference value between the emotion quantization data of the pre-editing object of the editing operation and the emotion quantization data of the post-editing object of the editing operation. The emotion quantization data can be determined by means of text emotion analysis and represents the emotion quantization data in the form of emotion percentage.
For the substitution operation, the emotion difference data Pe (Si, si') of the ith substitution operation satisfies the equation eight:
pe (Si, si ') = | Si-Si' | formula eight
Si is the emotion percentage of the object before replacement of the ith replacement operation, and Si' is the emotion percentage of the object after replacement of the ith replacement operation. The value range of Si and Si' is 0-1. At this time, when Si = Si ', pe (Si, si ') =0, and when Si > Si ' or Si < Si ', pe (Si, si ') is a positive value. Based on this, the sum of the emotion difference data of all the replacement operations
Figure BDA0003805272330000171
As for the emotion difference data of the insert operation and the delete operation, it can be ignored.
On the basis, if only the influence of the emotion offset on the performance evaluation index is considered,
Figure BDA0003805272330000172
substituting the two parameters into a formula II, wherein the obtained performance evaluation index WER meets a formula nine:
Figure BDA0003805272330000173
considering that the semantics must be shifted in case of emotion shift, the exemplary embodiments of the present disclosure may determine the editing operation using syntax analysisThen determining the importance of the editing operation of different error types based on the semantic definition level quantized value, and then introducing the emotion difference data of the replacement operation, so that E in the formula II General (1) The formula ten may be satisfied:
Figure BDA0003805272330000174
the formula II is replaced by the formula II, so that the performance evaluation index integrating the semantic offset and the emotion offset when only the editing importance of the editing operation is considered to correct and edit the quantized data can be obtained, and the performance evaluation index WER meets the formula eleven:
Figure BDA0003805272330000175
in order to prove that the method of the exemplary embodiment of the present disclosure can reflect not only the objective difference between the predicted text and the standard text, but also the subjective difference between the predicted text and the standard text from the perspective of semantic understanding, the following description is taken by way of example.
Example one
Standard text ref = "i say which play we go this night? "whose word segmentation results in [ ' I ', ' say ', ' this night ', ' we ', ' go ', ' where ', ' play ', '? ']. First predicted text hyp1= "which play i say experience we go? "whose word segmentation results in [ ' I ', ' say ', ' experience ', ' We ', ' go ', ' where ', ' play ', '? ']. Second predicted text hyp2= "which play i say semen we go? "whose word segmentation results in [ ' I ', ' say ', ' semen ', ' We ', ' go ', ' where ', ' play ', '? ']. Table five shows the comparison results of the different performance evaluation indexes of example one.
Table five comparison results of different performance evaluation indexes of example one
Example one Related art WER Fused semantic migration WER Fusing semantic offset and emotion offset WER
(ref,hyp1) 0.125 0.222 0.232
(ref,hyp2) 0.125 0.222 0.247
(Ref, hyp 1) denotes an example of editing the first predicted text hyp1 into a standard text Ref, which is simply referred to as a first case for convenience of the following description; (Ref, hyp 2) denotes an example of editing the second predicted text hyp2 into the standard text Ref, and is simply referred to as a second case for convenience of the following description.
Example two
The standard text ref = "sword-nan, i have a further problem of i'm boredom. ", the result of this word segmentation is [ ' Jiannan ', ', ', ' I ', ' still ', ' one ', ' problem ', ' just I ', ' epiglottis ', '. ']. The first predicted text hyp1= "build south, i have a further problem that i dislike. "the result of the word segmentation is that the result of the word segmentation is [ 'Jiannan', ',' I ',' also ',' one ',' problem ',' I ',' anechoic ',' and 'I'. ']. The second predicted text hyp2= "basic men, i have a further problem of i'm boredom. ", the results of their word segmentation are [ ' basic male ', ', ' I ', ' also ', ' one ', ' problem ', ' I ', ' anaerobic ', ' I '. ']. Table six shows the comparison results of the different performance evaluation indexes of example two.
TABLE sixty comparative results of the different performance evaluation indexes of example II
Example two Related art WER Fused semantic migration WER Fusing semantic offset and emotion offset WER
(ref,hyp1) 0.111 0.2 0.209
(ref,hyp2) 0.111 0.2 0.230
(ref, hyp 1) denotes an example of editing the first predicted text hyp1 into a standard text ref, which is simply referred to as a first case for convenience of the following description; (ref, hyp 2) represents an example of editing the second predicted text hyp2 into the standard text ref, and is simply referred to as the second case for convenience of the following description.
In the first and second embodiments, when the user speaks the standard text ref, the speech recognition model may be transcribed as two possible error cases, the first predicted text hyp1 and the second predicted text hyp 2. In the two cases, in an office scene, the first predicted text hyp1 can be regarded as a common error, and the user cannot cause too much influence on semantics and emotions, but the second predicted text hyp2 causes a lot of emotional troubles to office staff in a public place such as a conference place and is regarded as an emotional error.
With the WER of the related art, it is not possible to sufficiently measure the performance of the speech recognition model, for example: in the first embodiment, the correlation technique WER determined by the formula one is 0.125, and in the second embodiment, the correlation technique WER determined by the formula one is 0.111, so that the difference in the semantics and emotions of the Y can not be caused to the user in different situations in a differentiated manner.
When the voice recognition model considers the influence of the transcription error on the performance evaluation index in the aspect of semantic migration, the performance evaluation index is fused with the semantic migration, and the semantic migration degree can be reflected differently. When the voice recognition model considers the influence of the transcription error on the performance evaluation index in the aspect of semantic deviation and emotion, the performance evaluation index is fused with the semantic deviation and the emotion deviation, and the semantic deviation degree and the emotion deviation degree can be reflected differently.
As shown in table five, in the first embodiment, the fused semantic offset WER =0.222 for the first case determined by equation seven, and the fused semantic offset and emotion offset WER =0.232 for the first case determined by equation eleven; the fused semantic offset WER =0.222 for the second case determined using equation seven, and the fused semantic offset and emotion offset WER =0.247 determined using equation eleven.
As shown in table six, in the second embodiment, the first case fused semantic offset WER =0.2 determined by equation seven, and the fused semantic offset and emotion offset WER =0.209 determined by equation eleven; the fused semantic offset WER =0.2 for the second case determined using equation seven, and the fused semantic offset and emotion offset WER =0.230 determined using equation eleven.
As can be seen from table five and table six, in the case where the WER fused semantic offset does not fuse emotion offsets, in the same embodiment, the fused semantic offset WER of the first case and the second case is the same. However, there is a difference between the fusion semantic offset and the emotion offset, and the WER of the first case, which is a general error, is smaller than the WER of the second case, which is an emotional error, consistent with the actual judgment,
therefore, the fusion semantic migration WER can reflect the algorithm performance of fusion semantic calculation, and the fusion semantic migration WER and the emotion migration WER can reflect the algorithm performance of fusion semantic syntactic analysis and emotion calculation. When the semantic shift is not serious, the fusion semantic shift WER is close to the related technology WER, the fusion semantic shift WER with extremely serious semantic shift is larger than the related technology WER, when the semantic shift and the emotion shift are not serious, the fusion semantic shift and the emotion shift WER are close to the fusion semantic shift and the emotion shift WER, and when the semantic shift and the emotion shift are serious, the fusion semantic shift and the emotion shift WER are larger than the fusion semantic shift and the emotion shift WER, so that performance evaluation indexes used in the method of the exemplary embodiment of the present disclosure can reflect algorithm performance of the voice recognition model in a differentiated manner.
EXAMPLE III
The standard text ref = "relevant functions have already been in the regression test phase, and then some function points of the server end remain. "the result of the segmentation is [ ' correlation ', ' function ', ' already ', ' in ', ' regression ', ' test ', ' stage ', ' in ', ' then ', ' also ', ' left ', ' service ', ' in ', ' some ', ' function ', ' point ', ' etc. ']. The first prediction text hyp1= "relevant functions are already in the regression test stage, and some function points of the server end are left. ", the result of its participlation is that its participlation result is [ ' related ', ' function ', ' already ', ' in ', ' regression ', ' test ', ' stage ', ' used ', ' still ', ' left ', ' service ', ' of ', ' some ', ' function ', ' point ', ' etc. ']. The second prediction text hyp2= "relevant functions have already been in the testing phase, and then some function points of the server end remain. "which results in the word being [ ' associated ', ' functional ', ' already ', ' in ', ' test ', ' stage ', ' in ', ' then ', ' also ', ' left ', ' service ', ' in ', ' some ', ' function ', ' point ', ' etc. ']. Table seven shows the results of comparing the different performance evaluation indexes of example three.
TABLE seventy comparison of various performance evaluation indexes of EXAMPLE III
Figure BDA0003805272330000201
(ref, hyp 1) denotes an example of editing the first predicted text hyp1 into a standard text ref, which is simply referred to as a first case for convenience of the following description; (ref, hyp 2) denotes an example of editing the second predicted text hyp2 into the standard text ref, and for convenience of the following description, it is simply referred to as the second case.
In the third embodiment, when the user speaks the standard text ref, the speech recognition model may be transcribed into two possible error cases, the first predicted text hyp1 and the second predicted text hyp2, the first case lacking the conjunction "then" and having a negligible influence on the understanding of the whole sentence, and the second case lacking the noun "regression" and having a significant influence on the understanding of the sentence. Moreover, when the user understands the recognition results of the first case and the second case, the errors of the first case and the second case do not affect the emotion of the user.
With the WER of the related art, it is not possible to sufficiently measure the performance of the speech recognition model, for example: in the third embodiment, the WERs of the related technologies determined by the formula one are all 0.058, and the different influences of errors of different contents on different semantic levels brought to the user cannot be measured differently. As shown in table seven, in the third embodiment, the fused semantic offset WER =0.0625 for the first case determined using equation seven, and the fused semantic offset and emotion offset WER =0.0625 determined using equation eleven; the second case fused semantic shift WER =0.735, determined using equation seven, and fused semantic and emotion shifts WER =0.0735, determined using equation eleven. It can be seen that the WER is the same in the first case of the third embodiment regardless of whether the emotion offset is fused, and the WER is the same in the second case of the third embodiment regardless of whether the emotion offset is fused. When the first situation and the second situation are fused with semantic migration, the fusion semantic migration WER is the same, and when the first situation and the second situation are fused with semantic migration and emotion migration, the fusion semantic migration and emotion migration WER are equal to the fusion semantic migration WER.
In an alternative manner, the manner of correcting the edited quantized data by the semantic difference parameter of the exemplary embodiment of the present disclosure may be that the semantic difference parameter is introduced in a weighted manner, the semantic difference parameter may correct the edited quantized data by the semantic difference parameter in a weighted manner, and determine the first parameter by using the corrected edited quantized data, may correct the edited quantized data and identify correct quantized data by using the semantic difference parameter in a weighted manner, and determine the second parameter based on the corrected edited quantized data and the corrected identified correct quantized data.
The semantic difference parameter of the exemplary embodiment of the present disclosure includes a statistical weighting parameter, and if the predicted text contains the target participle, the statistical weighting parameter is greater than 1 for the identification quantization parameter of the statistical object belonging to the target participle.
In practical application, a keyword table may be established to define a mapping relationship between target keywords and statistical weighting parameters, to obtain a second mapping relationship, after performing word segmentation on the predicted text and the standard text, whether a word segmentation to which the statistical object belongs is a target word segmentation may be queried from the second mapping relationship, and if the word segmentation is the target word segmentation, the statistical weighting parameters corresponding to the target word segmentation may be determined from the second mapping relationship. On this basis, if the recognition of the statistical object belonging to the word segmentation is correct, when the statistical prediction text recognizes the correct number of words and the statistical object belonging to the word segmentation is counted, the statistical number of the statistical object belonging to the word segmentation is weighted by the statistical weighting parameter in addition to the already counted number of words recognized as correct (the value of the statistical number of the statistical object of the word segmentation is 1).
For example, the predicted text hyp = "can you learn manually only", the word segmentation results are [ 'you', 'learning', 'manually only', 'do you' ], the standard text ref = "do you learn artificial intelligence? "the word segmentation results in [ 'you', 'in', 'learn', 'artificial intelligence', 'do' ].
With the phrase as the minimum unit, 4 correct segmented words are identified, the number of replacement operations is 1, and both the deletion operation and the insertion operation are 0, that is, S =1,d = i =0, c =4. Without statistical weighting parameters, the performance evaluation index WER = (1 + 0)/(1 +0+ 4) =1/5=0.2 is calculated with the formula of formula one.
When the prediction text hyp is transcribed into the standard text ref, if the post-replacement object "artificial intelligence" of the replacement operation exists in the keyword list, the corresponding statistical weighting parameter is 20, and the performance evaluation index WER = (1 + 20+ 0)/(1 + 20+0+ 4) =20/24=5/6=0.83 is calculated by using the formula of formula one. If the correct word segmentation is identified to be present in the target keyword, the corresponding statistical weighting parameter is 20, and the performance evaluation index WER = (1 + 0)/(1 +0+1+ 20) =1/24=0.042 is calculated by the formula of formula one. Through comparison, it can be found that when the predicted text hyp is transcribed into the standard text ref, a keyword table can be designed, the keywords with substantial meanings are subjected to statistical weighting parameter assignment to form a second mapping relation, and if the statistical keyword table contains the participles of the standard sample or the predicted sample and the participles have substantial meanings, the situation can be reflected to the performance evaluation index through the form of the statistical weighting parameter.
It should be noted that, when the semantic difference parameters of the exemplary embodiment of the present disclosure include the statistical weighting parameter, the editing importance of the editing operation, and the emotion difference data of the editing operation, in the process of calculating the editing importance of the editing operation and the emotion difference data of the editing operation, the semantic difference parameters of S, D and I do not need to be weighted by the statistical weighting parameter.
As can be seen from the above, in one or more embodiments provided in the exemplary embodiment of the present disclosure, when the speech recognition model is in the training stage, when the statistical object is identified as erroneous, the identification quantization parameter includes the editing quantization data of the editing operation, and the semantic difference parameter is used to correct the editing quantization data, therefore, when the performance evaluation index is determined based on the editing quantization data and the identification correct quantization data, the editing quantization data may be corrected using the semantic difference data, so that the corrected editing quantization data may reflect not only the objective difference between the predicted text and the standard text, but also the subjective difference between the predicted text and the standard text from the perspective of semantic understanding. Based on the above, after the performance evaluation index is determined based on the corrected edited quantitative data and the recognized correct quantitative data, and when the error rate of the predicted text of the voice sample relative to the standard text is evaluated by using the performance evaluation index, the performance evaluation index can distinguish the influence of the semantic deviation on the recognition result, so that the performance evaluation index used by the method of the exemplary embodiment of the disclosure can differentially measure the influence of the predicted text on the understanding of the user, and the recognized text with large semantic deviation is prevented from being mistakenly recognized as the correct text to be output.
The above description mainly introduces the solutions provided by the embodiments of the present disclosure from the perspective of electronic devices. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The embodiment of the present disclosure may perform division of function units on the electronic device according to the method example, for example, each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation.
In the case of dividing each functional module by corresponding each function, the exemplary embodiments of the present disclosure provide a voice recognition method, which may be an electronic device or a chip applied to the electronic device. Fig. 5 shows a functional block schematic diagram of a speech recognition apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 5, the speech recognition apparatus 500 includes:
an obtaining module 501, configured to obtain a target voice;
a recognition module 502, configured to determine a recognition text of a target speech based on a speech recognition model, where the speech recognition model is trained in the following manner:
predicting a predicted text of a voice sample by using the voice recognition model, updating model parameters of the voice recognition model in response to that a performance evaluation index of the voice recognition model meets an iteration condition, wherein the performance evaluation index is used for measuring the error rate of the predicted text of the voice sample relative to a standard text, parameters of the performance evaluation index comprise a recognition quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly recognized, the recognition quantization parameter comprises recognition correct quantization data, when the statistical object is incorrectly recognized, the recognition quantization parameter comprises edit quantization data of an editing operation, and the semantic difference parameter is used for correcting the edit quantization data.
In a possible implementation manner, the performance evaluation index is determined by a first parameter and a second parameter, the performance evaluation index is positively correlated with the first parameter, the performance evaluation index is negatively correlated with the second parameter, and the semantic difference parameter is positively correlated with both the first parameter and the second parameter.
In a possible implementation manner, the semantic difference parameter includes an editing importance of an editing operation, and the editing importance of the editing operation and a semantic restriction level of a participle to which an editing object of the editing operation belongs are positively correlated.
In one possible implementation, the semantic definition level is determined by the part of speech of the word to which the statistical object with the recognition error belongs.
In a possible implementation manner, the editing operation is a replacement operation, and the editing importance degree P of the ith replacement operation Si Satisfies the following conditions:
Figure BDA0003805272330000231
lev(w Si ) Semantic definition level quantization value, lev (w), representing the word to which the pre-replacement object of the ith replacement operation belongs Si ) ' denotes a semantic definition level quantization value of the object after replacement of the ith replacement operation, and n denotes a total number of levels of semantic definition levels.
In one possible implementation manner, the editing operation is a deletion operation, and the editing importance P of the jth deletion operation Dj Satisfies the following conditions:
Figure BDA0003805272330000232
lev(w Dj ) And expressing the semantic definition level quantization value of the participle to which the deletion object of the j deletion operation belongs, wherein n expresses the total progression of the semantic definition level.
In one possible implementation, the editing operation is an insert operation, and the editing importance P of the kth insert operation Ik Satisfies the following conditions:
Figure BDA0003805272330000233
lev(w Ik ) And expressing the semantic definition level quantization value of the participle to which the inserted object of the kth inserting operation belongs, and n expresses the total series of the semantic definition level.
In one possible implementation, the semantic difference parameter includes emotion difference data of an editing operation, and the emotion difference data of the editing operation includes a difference value between emotion quantization data of a pre-editing object and emotion quantization data of a post-editing object.
In a possible implementation manner, the semantic difference parameter includes a statistical weighting parameter of an editing operation, and if the predicted text contains a target word segmentation, the statistical weighting parameter is greater than 1 for an identification quantization parameter of the statistical object belonging to the target word segmentation.
In a possible implementation manner, the statistical object is a statistical object with a word segmentation as a minimum unit, and the semantic difference parameters of the editing objects belonging to different word segmentations are independent of each other; or the like, or, alternatively,
and the statistical object is a statistical object taking a character as a minimum unit, and if the editing objects belonging to the same participle share the semantic difference parameter.
In the case of dividing each functional module by corresponding each function, the exemplary embodiments of the present disclosure provide a training method, which may be a server or a chip applied to the server. FIG. 6 shows a schematic block diagram of functional modules of a training apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 6, the training apparatus 600 includes:
a prediction module 601, configured to predict a predicted text of a speech sample by using a speech recognition model;
an updating module 602, configured to update a model parameter of the speech recognition model in response to that a performance evaluation index of the speech recognition model satisfies an iteration condition, where the performance evaluation index is used to measure an error rate of a predicted text of a speech sample relative to a standard text, and a parameter of the performance evaluation index includes an identification quantization parameter of a statistical object contained in the predicted text and a semantic difference parameter, where the identification quantization parameter includes identification correct quantization data when the statistical object is correctly identified, and the identification quantization parameter includes edit quantization data of an editing operation when the statistical object is incorrectly identified, and the semantic difference parameter is used to correct the edit quantization data.
Fig. 7 shows a schematic block diagram of a chip according to an exemplary embodiment of the present disclosure. As shown in fig. 7, the chip 700 includes one or more than two (including two) processors 701 and a communication interface 702. The communication interface 702 may perform the obtaining steps of the above-described method and the processor 701 may perform the processing steps of the above-described method.
Optionally, as shown in fig. 7, the chip 700 further includes a memory 703, and the memory 703 may include a read-only memory and a random access memory and provide the processor with operation instructions and parameters. The portion of memory may also include non-volatile random access memory (NVRAM).
In some embodiments, as shown in fig. 7, the processor 701 performs the corresponding operation by calling an operation instruction stored in the memory (the operation instruction may be stored in the operating system). The processor 701 controls the processing operations of any of the terminal devices, and may also be referred to as a Central Processing Unit (CPU). Memory 703 may include both read-only memory and random-access memory, and provides instructions and parameters to processor 701. A portion of the memory 703 may also include NVRAM. For example, in applications where the memory, communication interface, and memory are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a parameter bus. For clarity of illustration, however, the various buses are labeled in fig. 7 as the bus system 704.
The method disclosed by the embodiment of the disclosure can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field-programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.
The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.
The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.
Referring to fig. 8, a block diagram of an electronic device that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and parameters necessary for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
As shown in fig. 8, a number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 808 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information with other devices via a computer network such as the internet and/or various telecommunication networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, wireless-Fidelity (WiFi) devices, (Worldwide Interoperability for Microwave Access, wiMax) devices, cellular communication devices, and/or the like.
As shown in FIG. 8, computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing Unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above. For example, in some embodiments, the methods of the exemplary embodiments of the present disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured in any other suitable manner (e.g., by means of firmware) to perform the methods of the exemplary embodiments of the present disclosure.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium may include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), an optical fiber, a Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the procedures or functions described in the embodiments of the present disclosure are performed in whole or in part. The computer may be a general purpose computer, special purpose computer, computer network, terminal, user equipment, or other programmable device. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or optical media such as Digital Video Disks (DVDs); it may also be a semiconductor medium, such as a Solid State Drive (SSD).
While the disclosure has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the disclosure. Accordingly, the specification and figures are merely exemplary of the present disclosure as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present disclosure. It will be apparent to those skilled in the art that various changes and modifications may be made to the present disclosure without departing from the spirit and scope of the disclosure. Thus, if such modifications and variations of the present disclosure fall within the scope of the claims of the present disclosure and their equivalents, the present disclosure is intended to include such modifications and variations as well.

Claims (12)

1. A speech recognition method, comprising:
obtaining target voice, and determining a recognition text of the target voice based on a voice recognition model, wherein the voice recognition model is obtained by training in the following way:
predicting a predicted text of a voice sample by using the voice recognition model, updating model parameters of the voice recognition model in response to that a performance evaluation index of the voice recognition model meets an iteration condition, wherein the performance evaluation index is used for measuring the error rate of the predicted text of the voice sample relative to a standard text, parameters of the performance evaluation index comprise a recognition quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly recognized, the recognition quantization parameter comprises recognition correct quantization data, when the statistical object is incorrectly recognized, the recognition quantization parameter comprises editing quantization data of editing operation, and the semantic difference parameter is used for correcting the editing quantization data.
2. The method of claim 1, wherein the performance evaluation index is determined by a first parameter and a second parameter, the performance evaluation index is positively correlated with the first parameter, the performance evaluation index is negatively correlated with the second parameter, and the semantic difference parameter is positively correlated with both the first parameter and the second parameter.
3. The method according to claim 1, wherein the semantic difference parameters comprise editing importance of an editing operation, and the editing importance of the editing operation is positively correlated with semantic restriction level of a participle to which an editing object of the editing operation belongs.
4. The method of claim 3, wherein the semantic definition level is determined by the part of speech of the word to which the statistical object with the recognition error belongs.
5. The method according to any one of claims 1 to 4, wherein the semantic difference parameters comprise emotion difference data of an editing operation, and the emotion difference data of the editing operation comprises a difference value between emotion quantization data of a pre-editing object and emotion quantization data of a post-editing object.
6. The method according to any one of claims 1 to 4, wherein the semantic difference parameters include a statistical weighting parameter for an editing operation, and if the predicted text contains a target word segmentation, the statistical weighting parameter is greater than 1 for an identification quantization parameter of the statistical object belonging to the target word segmentation.
7. The method according to any one of claims 1 to 4, wherein the statistical object is a statistical object having a word segmentation as a minimum unit, and the semantic difference parameters of the editing objects belonging to different words are independent of each other; or the like, or, alternatively,
and the statistical object is a statistical object taking a character as a minimum unit, and if the editing objects belonging to the same participle share the semantic difference parameter.
8. A method of training, comprising:
predicting a predicted text of the speech sample by using the speech recognition model;
updating model parameters of the voice recognition model in response to the performance evaluation index of the voice recognition model meeting an iteration condition;
the performance evaluation index is used for measuring the error rate of a predicted text of a voice sample relative to a standard text, the parameters of the performance evaluation index comprise an identification quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is identified correctly, the identification quantization parameter comprises identification correct quantization data, when the statistical object is identified incorrectly, the identification quantization parameter comprises editing quantization data of editing operation, and the semantic difference parameter is used for correcting the editing quantization data.
9. A speech recognition apparatus, comprising:
the acquisition module is used for acquiring target voice;
the recognition module is used for determining a recognition text of the target voice based on a voice recognition model, and the voice recognition model is obtained by training in the following way:
predicting a predicted text of a voice sample by using the voice recognition model, updating model parameters of the voice recognition model in response to that a performance evaluation index of the voice recognition model meets an iteration condition, wherein the performance evaluation index is used for measuring the error rate of the predicted text of the voice sample relative to a standard text, parameters of the performance evaluation index comprise a recognition quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly recognized, the recognition quantization parameter comprises recognition correct quantization data, when the statistical object is incorrectly recognized, the recognition quantization parameter comprises edit quantization data of an editing operation, and the semantic difference parameter is used for correcting the edit quantization data.
10. An exercise device, comprising:
the prediction module is used for predicting the prediction text of the voice sample by utilizing the voice recognition model;
the updating module is used for responding to that a performance evaluation index of the voice recognition model meets an iteration condition, updating model parameters of the voice recognition model, wherein the performance evaluation index is used for measuring the error rate of a predicted text of a voice sample relative to a standard text, parameters of the performance evaluation index comprise an identification quantization parameter and a semantic difference parameter of a statistical object contained in the predicted text, when the statistical object is correctly identified, the identification quantization parameter comprises identification correct quantization data, when the statistical object is incorrectly identified, the identification quantization parameter comprises editing quantization data of an editing operation, and the semantic difference parameter is used for correcting the editing quantization data.
11. An electronic device, comprising:
a processor; and
a memory storing a program;
wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1 to 8.
12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to any one of claims 1 to 8.
CN202210995296.XA 2022-08-18 2022-08-18 Speech recognition method, training method, device, electronic equipment and storage medium Pending CN115359799A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210995296.XA CN115359799A (en) 2022-08-18 2022-08-18 Speech recognition method, training method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210995296.XA CN115359799A (en) 2022-08-18 2022-08-18 Speech recognition method, training method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115359799A true CN115359799A (en) 2022-11-18

Family

ID=84002924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210995296.XA Pending CN115359799A (en) 2022-08-18 2022-08-18 Speech recognition method, training method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115359799A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579000A (en) * 2022-12-07 2023-01-06 中诚华隆计算机技术有限公司 Intelligent correction method and system for voice recognition chip
CN115687334A (en) * 2023-01-05 2023-02-03 粤港澳大湾区数字经济研究院(福田) Data quality inspection method, device, equipment and storage medium
CN115938353A (en) * 2022-11-24 2023-04-07 北京数美时代科技有限公司 Voice sample distributed sampling method, system, storage medium and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115938353A (en) * 2022-11-24 2023-04-07 北京数美时代科技有限公司 Voice sample distributed sampling method, system, storage medium and electronic equipment
CN115579000A (en) * 2022-12-07 2023-01-06 中诚华隆计算机技术有限公司 Intelligent correction method and system for voice recognition chip
CN115579000B (en) * 2022-12-07 2023-03-03 中诚华隆计算机技术有限公司 Intelligent correction method and system for voice recognition chip
CN115687334A (en) * 2023-01-05 2023-02-03 粤港澳大湾区数字经济研究院(福田) Data quality inspection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2019153737A1 (en) Comment assessing method, device, equipment and storage medium
CN106897439B (en) Text emotion recognition method, device, server and storage medium
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN107679234A (en) Customer service information providing method, device, electronic equipment, storage medium
CN111309915A (en) Method, system, device and storage medium for training natural language of joint learning
CN115359799A (en) Speech recognition method, training method, device, electronic equipment and storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN111177186B (en) Single sentence intention recognition method, device and system based on question retrieval
CN112687328B (en) Method, apparatus and medium for determining phenotypic information of clinical descriptive information
CN113672732B (en) Method and device for classifying service data
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
CN116719520B (en) Code generation method and device
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN116910633A (en) Power grid fault prediction method based on multi-modal knowledge mixed reasoning
CN114997169A (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN113705207A (en) Grammar error recognition method and device
CN112685374B (en) Log classification method and device and electronic equipment
CN117114657A (en) Fault information early warning system and method based on power equipment inspection knowledge graph
CN116305257A (en) Privacy information monitoring device and privacy information monitoring method
CN113011162B (en) Reference digestion method, device, electronic equipment and medium
CN117151089A (en) New word discovery method, device, equipment and medium
CN110442714B (en) POI name normative evaluation method, device, equipment and storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN114117082B (en) Method, apparatus, and medium for correcting data to be corrected

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination