CN116229954A - Data processing method and device, readable medium and electronic equipment - Google Patents

Data processing method and device, readable medium and electronic equipment Download PDF

Info

Publication number
CN116229954A
CN116229954A CN202310085274.4A CN202310085274A CN116229954A CN 116229954 A CN116229954 A CN 116229954A CN 202310085274 A CN202310085274 A CN 202310085274A CN 116229954 A CN116229954 A CN 116229954A
Authority
CN
China
Prior art keywords
text
decoded
target text
attention distribution
repeated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310085274.4A
Other languages
Chinese (zh)
Inventor
叶圣泽
陈智鹏
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202310085274.4A priority Critical patent/CN116229954A/en
Publication of CN116229954A publication Critical patent/CN116229954A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The disclosure relates to a data processing method, a data processing device, a readable medium and electronic equipment, which can solve the problem of repeated decoding of a model on the basis of not changing the effect of a voice recognition model, thereby improving the accuracy of voice recognition. The method comprises the following steps: obtaining a decoding text which is output after the voice recognition model processes the audio to be recognized; determining whether repeated decoding occurs in the process of obtaining the decoded text by the voice recognition model according to the decoded text; and if the repeated decoding is determined, performing de-duplication processing on the decoded text, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.

Description

Data processing method and device, readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, a readable medium, and an electronic device.
Background
In the field of speech recognition, different modules such as an acoustic module, a language model, a dictionary, a decoder and the like are generally fused into an end-to-end speech recognition system, training is performed based on audio-text, and a good effect can be obtained in a speech recognition task. In the end-to-end speech recognition model, all acoustic information of the encoder can be fully utilized in each step of decoding through the attention mechanism between the encoder and the decoder in the model and the autoregressive characteristics of decoding, so that a better speech recognition effect is achieved. However, in the process of performing speech recognition based on the above-mentioned autoregressive characteristics and global attention mechanisms, the model is sometimes caused to continuously pay attention to a certain frame or frames of data during the decoding process, so that a certain segment in the decoding result repeatedly appears, that is, a repeated decoding problem is caused.
Disclosure of Invention
This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a data processing method, the method comprising:
obtaining a decoding text which is output after the voice recognition model processes the audio to be recognized;
determining whether repeated decoding occurs in the process of obtaining the decoded text by the voice recognition model according to the decoded text;
and if the repeated decoding is determined, performing de-duplication processing on the decoded text, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
In a second aspect, the present disclosure provides a data processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a decoding text which is output after the voice recognition model processes the audio to be recognized;
the determining module is used for determining whether repeated decoding occurs to the voice recognition model in the process of obtaining the decoded text according to the decoded text;
And the de-duplication module is used for carrying out de-duplication processing on the decoded text if the repeated decoding is determined, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having at least one computer program stored thereon;
at least one processing means for executing the at least one computer program in the storage means to carry out the steps of the method of the first aspect of the present disclosure.
According to the technical scheme, firstly, the decoding text which is output after the voice recognition model is processed for the audio to be recognized is obtained, then whether repeated decoding occurs in the process of obtaining the decoding text by the voice recognition model is determined, if the repeated decoding occurs, the decoding text is subjected to de-duplication processing, and the text obtained after the de-duplication processing is used as the text corresponding to the audio to be recognized. Therefore, on the basis of not changing the decoding output effect of the voice recognition model, the decoding text obtained by decoding the model is detected, whether the repeated decoding problem occurs or not is determined, and repeated parts in the decoding text are subjected to repeated correction in time under the condition that the repeated decoding problem is detected, so that the accuracy of voice recognition can be further improved.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow chart of a data processing method provided in accordance with one embodiment of the present disclosure;
FIG. 2 is an exemplary flowchart of the steps for determining whether a speech recognition model has repeated decoding in the process of obtaining decoded text in a data processing method provided in accordance with the present disclosure;
FIG. 3 is a block diagram of a data processing apparatus provided in accordance with one embodiment of the present disclosure;
fig. 4 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
All actions in this disclosure to obtain signals, information or data are performed in compliance with the corresponding data protection legislation policies of the country of location and to obtain authorization granted by the owner of the corresponding device.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.
As described in the background, in the current end-to-end speech model, for example, the LAS (Listen Attend and Spell) model, the repeated decoding problem may be caused due to the attention mechanism between the encoder and the decoder and the autoregressive nature of the decoding. In the related art, aiming at the problem of repeated decoding of a model, the idea is generally adopted to reduce the probability of repeated decoding occurring in the model from the aspect of model training, for example, to remove part of data with more repeated text in training data, to make attention distribution more even during training, or to change the global attention mechanism into a causal attention mechanism to avoid that the attention mechanism continuously focuses on a certain segment, etc., however, in the above manner, the characteristic of model autoregressive decoding is still unchanged, so that the problem of repeated decoding still may occur, and the means of changing the loss function and the attention mechanism belong to the change of the structure of the model itself, which may cause the damage of the model effect.
In order to solve the technical problems, the disclosure provides a processing method, a processing device, a readable medium and an electronic device, which can solve the repeated decoding problem of a model on the basis of not changing the effect of a speech recognition model, thereby improving the accuracy of speech recognition.
Fig. 1 is a flow chart of a data processing method provided in accordance with one embodiment of the present disclosure. As shown in fig. 1, the method provided by the present disclosure may include steps 11 to 13.
In step 11, a decoded text output after the voice recognition model processes the audio to be recognized is obtained.
In step 12, it is determined whether the speech recognition model is repeatedly decoded in the process of obtaining the decoded text, based on the decoded text.
In step 13, if it is determined that repeated decoding occurs, performing deduplication processing on the decoded text, and determining the text obtained after the deduplication processing as the text corresponding to the audio to be identified.
That is, the data provided by the present disclosure is a processing method, and is actually a further post-processing for the output result (i.e., the decoded text) of the speech recognition model, so that the speech recognition itself is not required to be modified, i.e., the processing effect of the speech recognition model itself is not changed, but whether the speech recognition model is repeatedly decoded in the processing process is determined based on the decoded text output by the speech recognition model, and if it is determined that the repeated decoding occurs, the decoded text output by the speech recognition model is subjected to a deduplication process, so as to implement correction, thereby ensuring the recognition accuracy of the audio to be recognized.
According to the technical scheme, firstly, the decoding text which is output after the voice recognition model is processed for the audio to be recognized is obtained, then whether repeated decoding occurs in the process of obtaining the decoding text by the voice recognition model is determined, if the repeated decoding occurs, the decoding text is subjected to de-duplication processing, and the text obtained after the de-duplication processing is used as the text corresponding to the audio to be recognized. Therefore, on the basis of not changing the decoding output effect of the voice recognition model, the decoding text obtained by decoding the model is detected, whether the repeated decoding problem occurs or not is determined, and repeated parts in the decoding text are subjected to repeated correction in time under the condition that the repeated decoding problem is detected, so that the accuracy of voice recognition can be further improved.
In order to make the data processing method provided by the present disclosure more understandable to those skilled in the art, the present disclosure will be described in more detail below in conjunction with the above steps.
In step 11, a decoded text output after the voice recognition model processes the audio to be recognized is obtained.
The speech recognition model is used for processing the audio to obtain a text corresponding to the audio, the speech recognition model may include an encoder decoder and a decoder (for example, an RNN model, that is, a recurrent neural network Recurrent Neural Network may be adopted), and the model realizes the processing of the audio through the attention mechanism between the encoder and the decoder and the autoregressive characteristic of the decoder. The speech recognition model may be, for example, the LAS model mentioned hereinabove.
Thus, step 11 is actually to obtain the output result of the speech recognition model, that is, the decoded text output by the speech recognition model through the decoder.
In step 12, it is determined whether the speech recognition model is repeatedly decoded in the process of obtaining the decoded text, based on the decoded text.
In one possible embodiment, step 12 may comprise the steps of:
in step 21, determining whether there are duplicate target text units in the decoded text;
in step 22, if there are repeated target text units, obtaining a corresponding attention distribution vector of each target text unit;
in step 23, determining whether there is a circulation of the attention distribution in the target text unit according to the acquired attention distribution vector;
if it is determined in step 24 that the target text unit has a loop of attention distribution, it is determined that the speech recognition model is repeatedly decoded in the process of obtaining the decoded text.
As described above, the speech recognition model is a model based on the mechanism of attention. In general, a model continuously focuses on a certain frame or a certain number of frames in the decoding process, which causes a certain segment to repeatedly appear in the decoding result, that is, a situation of repeated decoding appears, so that the repeated decoding usually corresponds to two performances, namely, the repetition of a text segment in the decoding result and the continuous focusing on a part of frames under the attention mechanism. The former can directly identify repeated text fragments through the text in the decoding result, and the latter can be reflected through the attention distribution vector.
Thus, step 21 may be performed first to determine whether there are duplicate target text units in the decoded text.
In one possible embodiment, step 21 may comprise the steps of:
acquiring a repeated judgment rule;
determining whether text fragments with the number of characters conforming to the number of text characters and the number of continuous occurrence times reaching a continuous repetition time threshold corresponding to the number of text characters exist in the decoded text according to each number of text characters;
if the text segment exists, determining that repeated target text units exist, and determining the text segment as the target text unit.
The repeated judgment rule is used for indicating the number of text characters and the continuous repeated times threshold value corresponding to each text character. In general, the larger the number of text characters, the fewer the number of repetitions corresponding thereto, whereas the smaller the number of text characters, the larger the number of repetitions corresponding thereto. For example, for the number of text characters 1, the corresponding threshold value of the number of consecutive repetitions may be set to 15, that is, the same character should appear up to 15 consecutive times before it is determined that the character is repeated; for the number of text characters 2, the corresponding threshold value of the number of continuous occurrences may be set to be 10, that is, if the number of continuous occurrences of the text segment of the same double character reaches 10, it is determined that the text segment of the double character is repeated, and if the number of continuous occurrences of the text segment of the double character does not reach 10, it is not determined that the text segment of the double character is repeated; for the number of text characters 3, the corresponding consecutive repetition number threshold may be set to 8, that is, the same three-character text segment is determined to be repeated if it appears continuously for up to 8 times, and is not determined to be repeated if the number of consecutive occurrences of the three-character text segment does not reach 8 times.
After the repetition determination rule is obtained, for each text character number, it may be determined whether there is a text segment in the decoded text whose character number corresponds to the text character number and whose number of consecutive occurrences reaches a threshold number of consecutive repetitions corresponding to the text character number. And if the text segment exists, determining that repeated target text units exist, and determining the text segment as the target text units.
That is, for each number of text characters, a determination is made in the above manner to determine whether there is a repeated text segment conforming to the number of text characters, respectively. Optionally, for each text character number, a character string which contains the text character number in the decoded text and continuously appears for multiple times may be greedily searched, and then whether the repetition number of the character string reaches the continuous repetition number threshold corresponding to the text character number is judged, so as to determine whether the character string is a text segment meeting the above conditions.
For example, if the decoded text is abcdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdeffz, and the given repetition determination rule is 15 consecutive repetition times threshold corresponding to the number of text characters 1, 10 consecutive repetition times threshold corresponding to the number of text characters 2, and 8 consecutive repetition times threshold corresponding to the number of text characters 3, then:
For the number of text characters 1, no single character that appears consecutively is recognized;
for text character number 2, no consecutive occurrences of double characters are recognized;
for the text character number 3, recognizing that a character string 'cde' continuously appears, wherein the character string 'cde' continuously appears 11 times and exceeds a continuous repetition frequency threshold value 8 corresponding to the text character number 3;
based on this, it can be determined that there is a text segment conforming to the number of text characters 3 whose number of consecutive occurrences reaches the peano repetition number threshold 8, that is, there is a repeated target text unit, and the character string "cde" is the target text unit.
Optionally, in the case that it is determined that there is a repeated target text unit, the position of the target text unit in the decoded text may be further located for subsequent judgment of the attention distribution vector and deduplication processing. For example, the position of the start character when the target text unit first appears in the decoded text and the position of the end character when the target text unit last appears in the decoded text may be obtained. Taking the example that the decoded text in the above example is abcdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdeffz as an example, assuming that the subscript corresponding to the first character a in the decoded text is 0 and the subscript corresponding to the second character b is 1, and the like, then the position of the start character when the target text unit "cde" first appears is obtained as the subscript 2, and the position of the end character when the target text unit "cde" last appears is obtained as the subscript 34.
In addition, if a text segment with the number of characters conforming to the number of text characters and the number of continuous occurrence times reaching the threshold value of the number of continuous repetition times corresponding to the number of text characters cannot be found, it is indicated that no repeated target text unit exists in the decoded text, that is, the decoded text is not subjected to repeated decoding.
If it is determined that there is a repeated target text unit, the description conforms to one of the above repeated decoded expressions, and thus, it is further necessary to determine whether or not attention mechanisms are continuously focused on a certain portion of data, and the above determination of the attention mechanisms may be performed in conjunction with an attention distribution vector.
Thus, in step 22, if there are duplicate target text units, a corresponding attention distribution vector is obtained for each target text unit.
In the attention mechanism-based model, the decoder obtains a corresponding decoded token based on the data output from the encoder and the attention distribution vector, and thus, the attention distribution vector corresponding to a token is easily obtained as intermediate data whenever the decoder determines a token. Thus, each character in each target text unit corresponds to a respective attention distribution vector (i.e., the attention distribution vector of the character is obtained after being input into the decoder), and thus the attention distribution vector corresponding to each target text unit can also be directly obtained. For example, if the number of text characters included in the target text unit is 1, the attention distribution vector corresponding to the target text unit is actually the attention distribution vector corresponding to the single character included in the target text unit. For another example, if the number of text characters included in the target text unit is greater than or equal to 2, the attention distribution vector corresponding to the target text unit includes a start attention distribution vector corresponding to the start character of the target text unit and an end attention distribution vector corresponding to the end character of the target text unit, and in the target text unit "cde" in the above example, the attention distribution vectors corresponding to the target text unit may include an attention distribution vector corresponding to "c" and an attention distribution vector corresponding to "e", and for each "cde" in the decoded text, the attention distribution vectors thereof are determined in the above manner.
After the attention distribution vector corresponding to each target text unit is obtained, step 23 may be executed to determine whether there is a cycle of attention distribution for the target text unit according to the obtained attention distribution vector.
In one possible embodiment, step 23 may comprise the steps of:
for each target text unit, determining the similarity between the attention distribution vectors corresponding to other target text units except the target text unit and the attention distribution vector corresponding to the target text unit;
and if each similarity is larger than a preset similarity threshold value, determining that the target text unit has a circulation of attention distribution.
That is, for each target text unit, it is necessary to determine the similarity between the attention distribution vector of the target text unit and that of the other target text units, and if the attention distribution vectors are similar, it is explained that the loop of the attention distribution occurs simultaneously on the basis of the occurrence of the repeated text pieces, and thus it can be determined that the repetition decoding has occurred. For example, if there are 5 target text units, then for each target text unit, the similarity between the attention distribution vector of that target text unit and the attention distribution vectors of the other 4 target text units needs to be determined.
For example, if the attention distribution vector corresponding to the target text unit includes a start attention distribution vector corresponding to a start character of the target text unit and an end attention distribution vector corresponding to an end character of the target text unit, the similarity may include a first similarity between the start attention distribution vectors of different target text units and a second similarity between the end attention distribution vectors of different target text units. For example, if there are 5 target text units, target text units 1-5, respectively, the similarity between target text unit 1 and target text unit 5 may include a first similarity between the starting attention distribution vector of target text unit 1 and the starting attention distribution vector of target text unit 5, and a second similarity between the ending attention distribution vector of target text unit and the ending attention distribution vector of target text unit 5.
Alternatively, the similarity between the attention distribution vectors may be obtained by calculating cosine similarity between the attention distribution vectors.
In this way, if each calculated similarity (for the case that the target text unit includes two or more characters, the similarity includes both the first similarity and the second similarity) is greater than the predetermined similarity threshold, it may be determined that the target text unit has a cycle of attention distribution.
If it is determined in step 24 that the target text unit has a cycle of attention distribution, it is determined that the speech recognition model is repeatedly decoded in the process of obtaining the decoded text.
Optionally, step 12 may further include the steps of:
if the repeated target text unit does not exist, or if the target text unit is determined to not exist the circulation of the attention distribution, the voice recognition model is determined not to be repeatedly decoded in the process of obtaining the decoded text.
That is, if either of the two performances of the repetition decoding mentioned above does not coincide, it can be determined that the repetition decoding does not occur.
Returning to fig. 1, in step 13, if it is determined that repeated decoding occurs, the decoded text is subjected to a deduplication process, and the text obtained after the deduplication process is determined as the text corresponding to the audio to be identified.
In one possible implementation, the de-duplication processing of the decoded text may include the following steps:
in decoding text, repeated target text units are kept as single target text units to obtain de-duplicated text.
That is, for decoded text in which decoding duplication occurs, after locating to target text units, deduplication of the decoded text is achieved by retaining one of these duplicated target text units.
Taking the above-given decoding text abcdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdecdeffz as an example, there is a target text unit "cde", and thus "cde" appearing 11 times is replaced by a single "cde", abcdefz is obtained, and the deduplication process is completed.
After the decoding text is subjected to the de-duplication processing, the text obtained after the de-duplication processing can be used as the text corresponding to the audio to be identified.
Alternatively, if it is determined by step 12 that repeated decoding does not occur, the decoded text obtained in step 11 may be directly determined as the text corresponding to the audio to be recognized.
Fig. 3 is a block diagram of a data processing apparatus provided in accordance with one embodiment of the present disclosure. As shown in fig. 3, the apparatus 30 may include:
the obtaining module 31 is configured to obtain a decoded text that is output after the voice recognition model processes the audio to be recognized;
a determining module 32, configured to determine, according to the decoded text, whether repeated decoding occurs in the process of obtaining the decoded text by the speech recognition model;
and the de-duplication module 33 is configured to perform de-duplication processing on the decoded text if it is determined that repeated decoding occurs, and determine the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
Optionally, the speech recognition model is a model based on an attention mechanism;
the determining module 32 includes:
a first determining submodule, configured to determine whether a repeated target text unit exists in the decoded text;
the first acquisition sub-module is used for acquiring the attention distribution vector corresponding to each target text unit if the repeated target text unit exists;
a second determining submodule, configured to determine whether a circulation of attention distribution exists in the target text unit according to the acquired attention distribution vector;
and the third determining submodule is used for determining that the voice recognition model is repeatedly decoded in the process of obtaining the decoded text if the target text unit is determined to have the circulation of the attention distribution.
Optionally, the determining module 32 further includes:
and a fourth determining sub-module, configured to determine that the speech recognition model does not perform repeated decoding in the process of obtaining the decoded text, if the repeated target text unit does not exist, or if it is determined that the target text unit does not have a cycle of attention distribution.
Optionally, the first determining sub-module includes:
The second acquisition submodule is used for acquiring a repetition judgment rule, wherein the repetition judgment rule is used for indicating the number of text characters and a continuous repetition frequency threshold value corresponding to each text character number;
a fifth determining submodule, configured to determine, for each text character number, whether there is a text segment in the decoded text, where the character number matches the text character number, and the number of consecutive occurrences reaches a consecutive repetition number threshold corresponding to the text character number;
and a sixth determining submodule, configured to determine that a repeated target text unit exists if the text segment exists, and determine the text segment as the target text unit.
Optionally, the second determining submodule is configured to include:
a seventh determining sub-module, configured to determine, for each of the target text units, a similarity between the attention distribution vectors corresponding to the other target text units except the target text unit and the attention distribution vectors corresponding to the target text unit;
and an eighth determining submodule, configured to determine that a cycle of attention distribution exists in the target text unit if each similarity is greater than a preset similarity threshold.
Optionally, if the number of text characters included in the target text unit is greater than or equal to 2, the attention distribution vector corresponding to the target text unit includes a start attention distribution vector corresponding to a start character of the target text unit and an end attention distribution vector corresponding to an end character of the target text unit;
the similarity includes a first similarity between the starting attention distribution vectors of the different target text units and a second similarity between the ending attention distribution vectors of the different target text units.
Optionally, in the case of repeated decoding, there are repeated target text units in the decoded text;
the de-duplication module 33 includes:
and the processing sub-module is used for reserving the repeated target text units as single target text units in the decoded text so as to obtain the text subjected to the de-duplication processing.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Based on the same conception, the present disclosure also provides a non-transitory computer readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the above-described data processing method.
Based on the same concept, the present disclosure also provides an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing said computer program in said storage means to implement the steps of the data processing method described above.
Referring now to fig. 4, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 4 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 4, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: obtaining a decoding text which is output after the voice recognition model processes the audio to be recognized; determining whether repeated decoding occurs in the process of obtaining the decoded text by the voice recognition model according to the decoded text; and if the repeated decoding is determined, performing de-duplication processing on the decoded text, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the acquisition module may also be described as "a module for acquiring decoded text output after the speech recognition model processes the audio to be recognized".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, there is provided a data processing method, the method comprising:
obtaining a decoding text which is output after the voice recognition model processes the audio to be recognized;
determining whether repeated decoding occurs in the process of obtaining the decoded text by the voice recognition model according to the decoded text;
and if the repeated decoding is determined, performing de-duplication processing on the decoded text, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
According to one or more embodiments of the present disclosure, there is provided a data processing method, the speech recognition model being an attention mechanism-based model;
the determining whether the speech recognition model is repeatedly decoded in the process of obtaining the decoded text according to the decoded text comprises the following steps:
determining whether there is a duplicate target text unit in the decoded text;
if the repeated target text units exist, acquiring the attention distribution vector corresponding to each target text unit;
determining whether the target text unit has a circulation of attention distribution according to the acquired attention distribution vector;
And if the target text unit is determined to have the circulation of the attention distribution, determining that the voice recognition model is repeatedly decoded in the process of obtaining the decoded text.
According to one or more embodiments of the present disclosure, there is provided a data processing method, the method further comprising:
and if the repeated target text unit does not exist, or if the target text unit is determined to not exist the circulation of the attention distribution, determining that the speech recognition model is not repeatedly decoded in the process of obtaining the decoded text.
According to one or more embodiments of the present disclosure, there is provided a data processing method, the determining whether there is a repeated target text unit in the decoded text, including:
acquiring a repetition judgment rule, wherein the repetition judgment rule is used for indicating the number of text characters and a continuous repetition frequency threshold value corresponding to each text character number;
determining whether text fragments, the number of which accords with the number of text characters and the number of continuous occurrence times of which reaches a continuous repetition time threshold corresponding to the number of text characters, exist in the decoded text according to each number of text characters;
if the text segment exists, determining that a repeated target text unit exists, and determining the text segment as the target text unit.
According to one or more embodiments of the present disclosure, there is provided a data processing method, the determining whether there is a loop of attention distribution of the target text unit according to the acquired attention distribution vector, including:
for each target text unit, determining the similarity between the attention distribution vectors corresponding to other target text units except the target text unit and the attention distribution vector corresponding to the target text unit;
and if each similarity is larger than a preset similarity threshold value, determining that the target text unit has a circulation of attention distribution.
According to one or more embodiments of the present disclosure, a data processing method is provided, where if the number of text characters included in the target text unit is greater than or equal to 2, the attention distribution vector corresponding to the target text unit includes a start attention distribution vector corresponding to a start character of the target text unit and an end attention distribution vector corresponding to an end character of the target text unit;
the similarity includes a first similarity between the starting attention distribution vectors of the different target text units and a second similarity between the ending attention distribution vectors of the different target text units.
According to one or more embodiments of the present disclosure, there is provided a data processing method in which, in the event of repeated decoding, there are repeated target text units in the decoded text;
the de-duplication processing for the decoded text comprises the following steps:
in the decoded text, the repeated target text units are reserved as single target text units to obtain a de-duplicated text.
According to one or more embodiments of the present disclosure, there is provided a data processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a decoding text which is output after the voice recognition model processes the audio to be recognized;
the determining module is used for determining whether repeated decoding occurs to the voice recognition model in the process of obtaining the decoded text according to the decoded text;
and the de-duplication module is used for carrying out de-duplication processing on the decoded text if the repeated decoding is determined, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
According to one or more embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the data processing method provided by any embodiment of the present disclosure.
According to one or more embodiments of the present disclosure, there is provided an electronic device including:
a storage device having at least one computer program stored thereon;
at least one processing means for executing the at least one computer program in the storage means to implement the steps of the data processing method provided by any embodiment of the present disclosure.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims (10)

1. A method of data processing, the method comprising:
obtaining a decoding text which is output after the voice recognition model processes the audio to be recognized;
determining whether repeated decoding occurs in the process of obtaining the decoded text by the voice recognition model according to the decoded text;
and if the repeated decoding is determined, performing de-duplication processing on the decoded text, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
2. The method of claim 1, wherein the speech recognition model is a model based on an attention mechanism;
The determining whether the speech recognition model is repeatedly decoded in the process of obtaining the decoded text according to the decoded text comprises the following steps:
determining whether there is a duplicate target text unit in the decoded text;
if the repeated target text units exist, acquiring the attention distribution vector corresponding to each target text unit;
determining whether the target text unit has a circulation of attention distribution according to the acquired attention distribution vector;
and if the target text unit is determined to have the circulation of the attention distribution, determining that the voice recognition model is repeatedly decoded in the process of obtaining the decoded text.
3. The method according to claim 2, wherein the method further comprises:
and if the repeated target text unit does not exist, or if the target text unit is determined to not exist the circulation of the attention distribution, determining that the speech recognition model is not repeatedly decoded in the process of obtaining the decoded text.
4. The method of claim 2, wherein the determining whether there are duplicate target text units in the decoded text comprises:
Acquiring a repetition judgment rule, wherein the repetition judgment rule is used for indicating the number of text characters and a continuous repetition frequency threshold value corresponding to each text character number;
determining whether text fragments, the number of which accords with the number of text characters and the number of continuous occurrence times of which reaches a continuous repetition time threshold corresponding to the number of text characters, exist in the decoded text according to each number of text characters;
if the text segment exists, determining that a repeated target text unit exists, and determining the text segment as the target text unit.
5. The method of claim 2, wherein the determining whether the target text unit has a cycle of attention distribution based on the obtained attention distribution vector comprises:
for each target text unit, determining the similarity between the attention distribution vectors corresponding to other target text units except the target text unit and the attention distribution vector corresponding to the target text unit;
and if each similarity is larger than a preset similarity threshold value, determining that the target text unit has a circulation of attention distribution.
6. The method of claim 5, wherein if the number of text characters contained in the target text unit is greater than or equal to 2, the attention distribution vector corresponding to the target text unit includes a start attention distribution vector corresponding to a start character of the target text unit and an end attention distribution vector corresponding to an end character of the target text unit;
the similarity includes a first similarity between the starting attention distribution vectors of the different target text units and a second similarity between the ending attention distribution vectors of the different target text units.
7. The method of claim 1, wherein in the event of repeated decoding, there are repeated target text units in the decoded text;
the de-duplication processing for the decoded text comprises the following steps:
in the decoded text, the repeated target text units are reserved as single target text units to obtain a de-duplicated text.
8. A data processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a decoding text which is output after the voice recognition model processes the audio to be recognized;
The determining module is used for determining whether repeated decoding occurs to the voice recognition model in the process of obtaining the decoded text according to the decoded text;
and the de-duplication module is used for carrying out de-duplication processing on the decoded text if the repeated decoding is determined, and determining the text obtained after the de-duplication processing as the text corresponding to the audio to be identified.
9. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-7.
10. An electronic device, comprising:
a storage device having at least one computer program stored thereon;
at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-7.
CN202310085274.4A 2023-01-28 2023-01-28 Data processing method and device, readable medium and electronic equipment Pending CN116229954A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310085274.4A CN116229954A (en) 2023-01-28 2023-01-28 Data processing method and device, readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310085274.4A CN116229954A (en) 2023-01-28 2023-01-28 Data processing method and device, readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116229954A true CN116229954A (en) 2023-06-06

Family

ID=86583782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310085274.4A Pending CN116229954A (en) 2023-01-28 2023-01-28 Data processing method and device, readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116229954A (en)

Similar Documents

Publication Publication Date Title
CN110321958B (en) Training method of neural network model and video similarity determination method
CN110298413B (en) Image feature extraction method and device, storage medium and electronic equipment
CN111968648B (en) Voice recognition method and device, readable medium and electronic equipment
CN113313064A (en) Character recognition method and device, readable medium and electronic equipment
CN115631514B (en) User identification method, device, equipment and medium based on palm vein fingerprint
CN116072108A (en) Model generation method, voice recognition method, device, medium and equipment
CN116483891A (en) Information prediction method, device, equipment and storage medium
CN110852242A (en) Watermark identification method, device, equipment and storage medium based on multi-scale network
CN116186545A (en) Training and application methods and devices of pre-training model, electronic equipment and medium
CN115546487A (en) Image model training method, device, medium and electronic equipment
CN111582456B (en) Method, apparatus, device and medium for generating network model information
CN114765025A (en) Method for generating and recognizing speech recognition model, device, medium and equipment
CN116229954A (en) Data processing method and device, readable medium and electronic equipment
CN114429629A (en) Image processing method and device, readable storage medium and electronic equipment
CN114693814B (en) Decoding method, text recognition method, device, medium and equipment for model
CN116343905B (en) Pretreatment method, pretreatment device, pretreatment medium and pretreatment equipment for protein characteristics
CN115938470B (en) Protein characteristic pretreatment method, device, medium and equipment
CN111292766B (en) Method, apparatus, electronic device and medium for generating voice samples
CN114359673B (en) Small sample smoke detection method, device and equipment based on metric learning
CN116434738A (en) Noise data extraction method, device, medium and electronic equipment
CN116340632A (en) Object recommendation method, device, medium and electronic equipment
CN116504269A (en) Pronunciation evaluation method and device, readable medium and electronic equipment
CN116416981A (en) Keyword detection method and device, electronic equipment and storage medium
CN116092092A (en) Matching method, device, medium and electronic equipment
CN116364066A (en) Classification model generation method, audio classification method, device, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination