CN116976311A - Processing method and device of voice recognition result, computer equipment and storage medium - Google Patents

Processing method and device of voice recognition result, computer equipment and storage medium Download PDF

Info

Publication number
CN116976311A
CN116976311A CN202310943642.4A CN202310943642A CN116976311A CN 116976311 A CN116976311 A CN 116976311A CN 202310943642 A CN202310943642 A CN 202310943642A CN 116976311 A CN116976311 A CN 116976311A
Authority
CN
China
Prior art keywords
task
processing
text content
original text
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310943642.4A
Other languages
Chinese (zh)
Inventor
申云飞
陈炳州
邹明
张阳
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202310943642.4A priority Critical patent/CN116976311A/en
Publication of CN116976311A publication Critical patent/CN116976311A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a processing method, a processing device, computer equipment and a storage medium for a voice recognition result. The method comprises the following steps: acquiring a voice recognition result obtained by voice recognition of the audio to be recognized, wherein the voice recognition result comprises original text content corresponding to the audio to be recognized; acquiring task guide information corresponding to original text content, wherein the task guide information at least comprises a processing task and a task requirement of the processing task; and extracting characteristic information meeting processing tasks from the original text content, and processing the characteristic information according to processing operations corresponding to task requirements to obtain target text content. According to the voice recognition method and the voice recognition device, the task guide information is constructed by utilizing the processing task and the requirement information of the processing task, the processing task of the voice recognition result is executed according to the task guide information, and the corresponding processing task is not required to be completed by means of a plurality of submodels, so that the processing efficiency of the voice recognition result is improved, and the workload of the early-stage preparation work is reduced.

Description

Processing method and device of voice recognition result, computer equipment and storage medium
Technical Field
The present disclosure relates to the field of speech processing, and in particular, to a method, an apparatus, a computer device, and a storage medium for processing a speech recognition result.
Background
Speech recognition techniques have been widely used, but the speech recognition results require post-processing to improve the quality of their final output results. Conventional post-processing tasks typically include punctuation recovery, reverse text conversion, text smoothing, and the like. In the current completion of the above-mentioned post-processing tasks, it is often necessary to resort to a plurality of independent sub-models, such as: punctuation recovery of the model corresponding to the task, inverse text conversion of the model corresponding to the task, and so on. When the method is applied, the voice recognition result is input into the corresponding submodel for processing.
Based on this, it can be found that in the process of post-processing the speech recognition result, the complexity of the overall system increases by completing the corresponding processing tasks (e.g., punctuation recovery, reverse text conversion, text smoothing) through a plurality of sub-models. Meanwhile, different sub-models are required to be trained for a plurality of different languages or scenes respectively in the early stage, so that more resources are consumed for preparation work in the early stage, and the maintenance work difficulty of each sub-model in the later stage is high.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a method, an apparatus, a computer device, and a storage medium for processing a speech recognition result, so as to solve the problems that the complexity of the overall system increases, the resources consumed by the preparation work in the early stage of the system are more, and the difficulty of the maintenance work in the later stage is greater when the corresponding processing tasks are completed through a plurality of sub-models.
In a first aspect, an embodiment of the present disclosure provides a method for processing a speech recognition result, where the method includes:
obtaining a voice recognition result obtained by voice recognition of audio to be recognized, wherein the voice recognition result comprises original text content corresponding to the audio to be recognized;
acquiring task guide information corresponding to the original text content, wherein the task guide information at least comprises: processing a task and a task requirement of the processing task;
and extracting characteristic information meeting the processing task from the original text content, and processing the characteristic information according to processing operation corresponding to the task requirement to obtain target text content.
In a second aspect, an embodiment of the present disclosure provides a device for processing a speech recognition result, where the device includes:
the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a voice recognition result obtained by voice recognition of audio to be recognized, and the voice recognition result comprises original text content corresponding to the audio to be recognized;
the processing module is used for acquiring task guide information corresponding to the original text content, wherein the task guide information at least comprises: task execution sequence, at least one processing task and task requirements of the processing task;
And the execution module is used for extracting the characteristic information meeting the processing task from the original text content according to the task execution sequence, and processing the characteristic information according to the processing operation corresponding to the task requirement to obtain the target text content.
In a third aspect, embodiments of the present disclosure provide a computer device comprising: the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions to perform the method of the first aspect or any implementation manner corresponding to the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or any of its corresponding embodiments.
The method provided by the embodiment of the disclosure has the following advantages: according to the method provided by the embodiment of the disclosure, the task guide information is constructed by utilizing the processing task and the requirement information of the processing task, the processing task of the voice recognition result is executed according to the task guide information, and the corresponding processing task is not required to be completed by means of a plurality of sub-models, so that the processing efficiency of the voice recognition result is improved. Even if the processing task is changed, only the task guide information is required to be updated, and the sub-model is not required to be retrained, so that the workload of the early-stage preparation work is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart of a method of processing speech recognition results according to some embodiments of the present disclosure;
FIG. 2 is a flow chart of a method of processing speech recognition results according to some embodiments of the present disclosure;
FIG. 3 is a flow diagram of processing of speech recognition results according to some embodiments of the present disclosure;
FIG. 4 is a block diagram of a processing device of speech recognition results according to an embodiment of the present disclosure;
fig. 5 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.
According to the embodiments of the present disclosure, there is provided a method, apparatus, computer device and storage medium for processing a speech recognition result, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different from that herein.
In this embodiment, a method for processing a speech recognition result is provided, which may be used in the above mobile terminal, such as a mobile phone, a tablet computer, etc., fig. 1 is a flowchart of a method for processing a speech recognition result according to an embodiment of the disclosure, and as shown in fig. 1, the flowchart includes the following steps:
step S11, a voice recognition result obtained by voice recognition of the audio to be recognized is obtained, wherein the voice recognition result comprises original text content corresponding to the audio to be recognized.
The method provided by the embodiment of the disclosure is applied to the intelligent equipment capable of performing voice processing, and the voice processing can be as follows: voice recording, voice recognition, voice synthesis, fluency detection, and the like. The intelligent device may be: personal computers, mobile devices (cell phones, tablet computers, etc.), smart wearable devices (smart watches, smart bracelets, etc.), and the like.
In the embodiment of the disclosure, the intelligent device may acquire the audio to be identified input by the user, where the audio to be identified may be the speech content recorded, or may be the audio of a certain segment in the movie work, or may be obtained by transmission through other devices, where the other devices may be other intelligent terminals, or may be storage media such as: a mobile hard disk, etc.
In an embodiment of the present disclosure, a process for identifying audio to be identified includes: firstly, extracting audio frames of audio to be identified, detecting each audio frame to obtain the audio characteristics of each audio frame, and constructing an audio frame characteristic sequence based on the audio characteristics of each audio frame, wherein the audio characteristics can be frame Mel frequency spectrum characteristics. And inputting the audio feature sequence into a text recognition model, recognizing language information corresponding to each audio feature in the audio feature sequence through the audio text recognition model, acquiring a language text library corresponding to the language information according to the audio feature sequence and the language information corresponding to each frame of audio feature, acquiring a text sequence corresponding to the recognized audio feature sequence from the language text library, and taking the text sequence as the original text content of the audio to be recognized.
Step S12, task guide information corresponding to the original text content is obtained, wherein the task guide information at least comprises: processing tasks and task requirements for processing tasks.
In the embodiment of the disclosure, acquiring task guide information corresponding to original text content includes the following steps A1-A4:
step A1, a task list is obtained, wherein the task list comprises at least one processing task and task requirements corresponding to the processing task.
In the embodiment of the present disclosure, the intelligent device may acquire a task list input by a user, where the task list includes: at least one processing task, which may be: punctuation recovery, reverse text conversion, text smoothing, text removal, and the like. Meanwhile, the task list further includes task requirements corresponding to each processing task, and the task requirements may be: restoring punctuation of a paragraph in the text, or removing a type of character in the text, etc.
As one example, the task list includes the following:
task 1. Punctuation recovery-all the punctuations in the text are recovered.
Task 2, removing the text, namely removing specific types of characters in the text, wherein the specific types can be determined according to application scenes corresponding to the text;
The task 1 and the task 2 are task identifications, the punctuation recovery and the text removal are output tasks, and the recovery of all punctuation in the text and the removal of specific types of characters in the text are task demands.
And step A2, reading the task list to obtain a task identifier and a task requirement corresponding to each processing task.
In the embodiment of the disclosure, the intelligent device reads the task list to obtain a task identifier and a task requirement corresponding to each processing task, and then converts the task identifier and the task requirement into a guidance (prompt) description by using natural language.
And step A3, determining the task execution sequence of the processing task by using the task identification.
In the embodiment of the present disclosure, the task execution order may be: (0) Declaring the overall flow of processing tasks, (1) processing task 1, (2) processing task 2.
And step A4, generating task guide information based on the task execution sequence, the processing tasks and the task requirements of the processing tasks.
In the disclosed embodiments, the task guidance (sample) information may be:
(0) Declaring an overall flow of processing tasks: processing task 1-processing task 2.
(1) Processing task 1, restoring all punctuation in the text.
(2) Processing task 2 removes a specific type of character in the text, which may be an emoticon or the like.
According to the method provided by the embodiment of the disclosure, by adding the task guide information corresponding to the text content, when the text content is processed by the subsequent autoregressive language model, corresponding features are extracted according to task requirements in the task guide information, and the processing is performed. Therefore, the processing efficiency is improved by adopting the task guide information to guide the autoregressive language model to execute the corresponding processing tasks, and even if a plurality of processing tasks exist, the processing can be completed by setting the execution sequence of the tasks and only inputting the original text content once in the autoregressive language model. Compared with the method for calling a plurality of sub-models, the method greatly improves the processing efficiency of the voice recognition result and greatly reduces the resource consumption caused by training.
In the embodiment of the present disclosure, whether an update operation for a task list exists may also be monitored, if the update operation exists, an updated processing task and a task requirement corresponding to the processing task are read, and task guide information is updated based on the updated processing task and the task requirement corresponding to the processing task.
And S13, extracting characteristic information meeting processing tasks from the original text content, and processing the characteristic information according to processing operations corresponding to task requirements to obtain target text content.
In the embodiment of the disclosure, feature information meeting processing tasks is extracted from original text content, and the feature information is processed according to processing operations corresponding to task requirements to obtain target text content, and the method comprises the following steps of:
step B1, obtaining a pre-trained autoregressive language model, wherein the autoregressive language model comprises: a feature extraction network and a processing network.
In the embodiment of the disclosure, the autoregressive language model may be a plurality of deep learning models such as a cyclic neural network (Recurrent Neural Network, abbreviated RNN) and a transducer, and the autoregressive language model is obtained by training with a large amount of unlabeled data and then training with a human feedback reinforcement learning (Reinforcement Learning from Human Feedback, abbreviated RLHF) technology.
Note that, a text sample is obtained, and a mask is performed on a part of characters in the text sample, for example: punctuation in a text sample is masked, or some type of character in the text sample is masked. The masked characters of the text sample are used as labels of the corresponding text sample, the text sample is input into an initial network model, and the initial network model outputs predicted characters of the masked positions in the text sample and the confidence of the predicted characters. And taking the initial network model as a network model to be subjected to intensive training. Inputting the similarity between each predicted character and the label into a network model to be reinforced and trained according to the sequence from high confidence to low confidence of each predicted character, and determining the predicted text of the text sample from each predicted character by the network model to be reinforced and trained; determining reinforcement learning rewards according to the predicted text and the labels, and training a network model to be reinforcement trained by taking the maximized rewards as an optimization target to obtain an autoregressive language model.
And step B2, carrying the task guide information in the original text content, and inputting the original text content carrying the task guide information into the autoregressive language model.
In the embodiment of the disclosure, the task guide information is carried in the original text content, so that the autoregressive language model is convenient to process the original text content according to the task guide information after receiving the original text content.
And B3, determining processing tasks based on the task execution sequence through a feature extraction network, extracting feature information meeting the processing tasks from the original text content, and processing the feature information through the processing network according to processing operations corresponding to task requirements to obtain target text content.
In the embodiment of the disclosure, a processing task is determined based on a task execution sequence through a feature extraction network, feature information meeting the processing task is extracted from original text content, and the feature information is processed through a processing network according to processing operations corresponding to task requirements to obtain target text content, and the method comprises the following steps of:
and step C1, determining a first processing task based on the task execution sequence through a feature extraction network, extracting first feature information meeting the first processing task from the original text content, and transmitting the first feature information to the processing network.
In the embodiment of the present disclosure, the feature extraction network determines a first processing task according to the task execution sequence, where the first processing task may be understood as a first processing task in the task execution sequence, for example: when the first processing task is punctuation recovery, the first feature information may be a phrase, a character, etc. in the original text content. Alternatively, the first feature information may be a phrase when the first processing task is a specific type of character removal. The feature extraction network delivers the first feature information to the processing network after extracting the first feature information.
And C2, determining a first processing operation corresponding to the task requirement of the first processing task through the processing network, processing the first characteristic information according to the first processing operation to obtain first text content, and transmitting the first text content to the characteristic extraction network.
In the embodiment of the disclosure, the processing network may set the corresponding first processing operation according to the task requirement.
As one example, the task requirements of the first processing task are: restoring punctuation of the first paragraph of the text content, the corresponding first processing operation being: extracting phrases or characters belonging to the first paragraph, and detecting target phrases capable of adding punctuation. Specifically, the processing network extracts the phrase or character belonging to the first paragraph, obtains the probability corresponding to the punctuation type possibly needing to be added based on the voice information among the adjacent phrases, takes the phrase with the highest probability as the target phrase needing to be added with the punctuation, and obtains the first text content after adding the punctuation to the second phrase. The first text content is eventually passed to the feature extraction network so that the feature extraction network continues to perform the next processing task.
And C3, determining a second processing task based on the task execution sequence through the feature extraction network, extracting second feature information meeting the second processing task from the first text content, and transmitting the second feature information to the processing network.
As one example, the task requirements of the second processing task are: and coding the emoticons in the first paragraph, wherein the feature extraction network extracts second feature information from the first text content, and the second feature information can be a character string set corresponding to the original text content.
When the first processing task is punctuation recovery, punctuation is added to the text content, so that extraction of character strings can be reduced for subsequent detection of emoticons (second processing task), and the workload of the second processing task is reduced.
And C4, determining a second processing operation corresponding to the task requirement of a second processing task through the processing network, and processing the second characteristic information according to the second processing operation to obtain second text content.
As one example, the task requirements of the second processing task are: masking the table condition symbol in the first paragraph, and performing corresponding second processing operation as follows: and extracting the character string set of the first paragraph, detecting a target character string belonging to the emoticon, and masking the target character string. Specifically, the processing network performs feature matching on each character in the character string set transmitted by the feature extraction network and a preset character string corresponding to the emotion symbol, takes the character string successfully matched as a target character string, and finally performs mask processing on the target character string of the first text content to obtain the second text content.
And step C5, when the second processing task is the last processing task in the task execution sequence, taking the second text content as the target text content.
In the embodiment of the disclosure, whether the third processing task exists in the task execution sequence is detected, if the third processing task does not exist, the second processing task is determined to be the last processing task in the task execution sequence, and the second text content is directly output as the target text content. If the third processing task exists in the task execution sequence, the process is repeated, the second text content is transferred to the feature extraction network, and the third processing task is continuously executed.
According to the method provided by the embodiment of the disclosure, the task guide information is constructed by utilizing the processing task and the requirement information of the processing task, the processing task of the voice recognition result is executed according to the task guide information, and the corresponding processing task is not required to be completed by means of a plurality of sub-models, so that the processing efficiency of the voice recognition result is improved. Even if the processing task is changed, only the task guide information is required to be updated, and the sub-model is not required to be retrained, so that the workload of the early-stage preparation work is reduced.
Fig. 2 is a flowchart of a method for processing a speech recognition result according to an embodiment of the present disclosure, as shown in fig. 2, the method includes:
Step S21, a voice recognition result obtained by voice recognition of the audio to be recognized is obtained, wherein the voice recognition result comprises original text content corresponding to the audio to be recognized. The detailed description refers to the corresponding related descriptions of the above embodiments, and will not be repeated here.
Step S22, task guide information corresponding to the original text content is obtained, wherein the task guide information at least comprises: processing tasks and task requirements for processing tasks. The detailed description refers to the corresponding related descriptions of the above embodiments, and will not be repeated here.
And S23, extracting characteristic information meeting processing tasks from the original text content, and processing the characteristic information according to processing operations corresponding to task requirements to obtain target text content.
In the embodiment of the disclosure, feature information meeting processing tasks is extracted from original text content, and the feature information is processed according to processing operations corresponding to task requirements to obtain target text content, and the method comprises the following steps of:
step D1, obtaining a pre-trained autoregressive language model, wherein the autoregressive language model comprises: a feature extraction network and a processing network. The detailed description refers to the corresponding related descriptions of the above embodiments, and will not be repeated here.
And D2, carrying the task guide information in the original text content, and inputting the original text content carrying the task guide information into the autoregressive language model.
In the embodiment of the disclosure, the task guide information is carried in the original text content, and the original text content carrying the task guide information is input into the autoregressive language model, which comprises the following steps of D201-D203:
step D201, obtaining splitting requirements corresponding to original text content.
And step D202, splitting the original text content according to splitting requirements to obtain a plurality of text fragments.
And step D203, carrying the task guide information on each text segment of the original text content, and sequentially inputting each text segment carrying the task guide information into the autoregressive language model.
In the disclosed embodiments, the split requirement may be determined from the traffic requirement, for example: the split requirement may be entered when the business requirement requires processing of a few segments, or words, in the original text content. The intelligent device splits the original text content according to the intelligent device to obtain a plurality of text fragments. Meanwhile, since the processing tasks to be executed by each text segment are the same, the task guidance information is carried in each text segment of the original text content. And finally, carrying the task guide information on each text segment of the original text content, and sequentially inputting each text segment carrying the task guide information into the autoregressive language model.
And D3, determining processing tasks based on the task execution sequence through a feature extraction network, extracting feature information meeting the processing tasks from the original text content, and processing the feature information through the processing network according to processing operations corresponding to task requirements to obtain target text content.
In the embodiment of the disclosure, a processing task is determined based on a task execution sequence through a feature extraction network, first segment features meeting the processing task are extracted from a first text segment, and the first segment features are processed through the processing network according to processing operations corresponding to task requirements, so that first target segment content of the first text segment is obtained. And then determining processing tasks based on the task execution sequence through a feature extraction network, extracting second segment features meeting the processing tasks from the second text segment, and processing the second segment features through the processing network according to processing operations corresponding to task requirements to obtain second target segment contents of the second text segment. The above process is repeated until all text segment processing is completed.
In the embodiment of the disclosure, after all text segments are processed, target segment contents output by an autoregressive language model for each text segment are obtained; and splicing the target segment content corresponding to the text segment according to the splitting sequence of the text segment to obtain target text content.
According to the method provided by the embodiment of the disclosure, the task guide information is constructed by utilizing the processing task and the requirement information of the processing task, the processing task of the voice recognition result is executed according to the task guide information, and the corresponding processing task is not required to be completed by means of a plurality of sub-models, so that the processing efficiency of the voice recognition result is improved. Even if the processing task is changed, only the task guide information is required to be updated, and the sub-model is not required to be retrained, so that the workload of the early-stage preparation work is reduced.
Fig. 3 is a flowchart of a method for processing a speech recognition result according to an embodiment of the present disclosure, as shown in fig. 3, the method includes:
step S31, whether the target text content meets the preset use condition is detected, and a detection result is obtained.
In the embodiment of the disclosure, the preset use condition is used for representing the use effect of the standard, and whether the corresponding example sample needs to be added in the task guide information is determined by detecting whether the target text content meets the preset use condition. If the detection result shows that the target text content does not meet the preset use condition, the current autoregressive language model is required to improve the processing precision by means of an example sample, so that the target text content is further ensured to meet the preset use condition. If the detection result shows that the target text content meets the preset use condition, the current autoregressive language model is required to be implemented without an example sample.
Step S32, when the detection result is that the target text content does not meet the preset use condition, acquiring an example sample corresponding to the processing task, and updating the task guide information by using the example sample to obtain updated task guide information.
In the embodiment of the disclosure, example samples corresponding to different processing tasks are different, for example: the processing task is to replace the emoticons in the text. An example sample may be: phrase "phrase: [ pulling ], [ Crying ], [ audio used ], etc., the output corresponding to "[ audio ]" is "1", the output corresponding to "[ Crying ]" is "2", and the output corresponding to "[ audio used ]" is "3".
In an embodiment of the present disclosure, the updated task guidance information includes: task identification, processing task requirements corresponding to tasks and example samples.
Step S33, the updated task guidance information and the original text content are input into the autoregressive language model.
In the embodiment of the disclosure, the updated task guide information and the original text content are input into the autoregressive language model, so that the autoregressive language model can process the original text content according to an example sample, and finally output target text content is ensured to meet preset use conditions.
Step S34, determining processing tasks based on the task execution sequence through a feature extraction network, extracting feature information meeting the processing tasks from the original text content according to the example samples, and processing the feature information through the processing network according to the example samples and processing operations corresponding to task requirements to obtain target text content.
As an example, when the processing task is standardized processing of english text (i.e., capitalizing letters of the first word of an english sentence, etc.), an example sample may be: "It is The second simple.the first step is to improve our appearance", etc., the output corresponding to "It" is "It", the output corresponding to "The" is "The feature extraction network in The autoregressive language model extracts The first word of each english sentence from The original text content according to The example sample to obtain" It "and" The "and use The word as The feature word. The above-described feature words are then passed to a processing network, which processes "It" as "It" and "The" as "The" according to The example sample. The final output target text content is "It is the simple. The first step is to improve our appearance".
According to the method provided by the embodiment of the application, when the target text content does not meet the preset use condition, the task guide information can be updated by using the example sample of the processing task, so that when the autoregressive language model subsequently processes the target text content, the corresponding characteristic information can be extracted by referring to the example sample, and meanwhile, the characteristic information can be processed by referring to the example sample, so that the accuracy of an output result is improved. The model is not required to be retrained or the model parameters are not required to be finely adjusted, so that the flexibility and the adaptability of the processing of the voice recognition result are improved.
In addition, because the autoregressive language model is already trained and fine-tuned through a large-scale sample, a large amount of fine labeling data is not needed, the time and cost for data preparation can be reduced, and the problems of data scarcity, difference among different languages and the like can be solved.
The embodiment also provides a device for processing a voice recognition result, which is used for implementing the foregoing embodiment and a preferred implementation manner, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a processing apparatus for a speech recognition result, as shown in fig. 4, including:
the obtaining module 41 is configured to obtain a speech recognition result obtained by performing speech recognition on the audio to be recognized, where the speech recognition result includes original text content corresponding to the audio to be recognized;
the processing module 42 is configured to obtain task guidance information corresponding to the original text content, where the task guidance information at least includes: task execution sequence, at least one processing task and task requirement of the processing task;
and the execution module 43 is used for extracting the characteristic information meeting the processing task from the original text content according to the task execution sequence, and processing the characteristic information according to the processing operation corresponding to the task requirement to obtain the target text content.
In the disclosed embodiment, the processing module 42 includes:
the first acquisition unit is used for acquiring a task list, wherein the task list comprises at least one processing task and a task requirement corresponding to the processing task;
the reading unit is used for reading the task list to obtain task identifiers and task demands corresponding to each processing task;
a determining unit for determining a task execution order of the processing tasks by using the task identification;
And the generating unit is used for generating task guide information based on the task execution sequence, the processing tasks and the task requirements of the processing tasks.
In the disclosed embodiment, the execution module 43 includes:
a second obtaining unit, configured to obtain a pre-trained autoregressive language model, where the autoregressive language model includes: a feature extraction network and a processing network;
the input unit is used for carrying the task guide information in the original text content and inputting the original text content carrying the task guide information into the autoregressive language model;
the processing unit is used for determining processing tasks based on the task execution sequence through the feature extraction network, extracting feature information meeting the processing tasks from the original text content, and processing the feature information through the processing network according to processing operations corresponding to task requirements to obtain target text content.
In the embodiment of the disclosure, a processing unit is used for determining a first processing task based on a task execution sequence through a feature extraction network, extracting first feature information meeting the first processing task from original text content, and transmitting the first feature information to the processing network; determining a first processing operation corresponding to a task requirement of a first processing task through a processing network, processing the first characteristic information according to the first processing operation to obtain first text content, and transmitting the first text content to a characteristic extraction network; determining a second processing task based on the task execution sequence through a feature extraction network, extracting second feature information which meets the second processing task from the first text content, and transmitting the second feature information to the processing network; determining a second processing operation corresponding to the task requirement of a second processing task through a processing network, and processing the second characteristic information according to the second processing operation to obtain second text content; and when the second processing task is the last processing task in the task execution sequence, taking the second text content as the target text content.
In the embodiment of the disclosure, an input unit is used for acquiring a splitting requirement corresponding to the original text content; splitting the original text content according to the splitting requirement to obtain a plurality of text fragments; and carrying the task guide information on each text segment of the original text content, and sequentially inputting each text segment carrying the task guide information into the autoregressive language model.
In an embodiment of the present disclosure, the apparatus further includes: the splicing module is used for acquiring target fragment contents output by the autoregressive language model for each text fragment; and splicing the target segment content corresponding to the text segment according to the splitting sequence of the text segment to obtain target text content.
In an embodiment of the present disclosure, the apparatus further includes: the updating module is used for detecting whether the target text content meets preset use conditions or not to obtain a detection result; under the condition that the detection result is that the target text content does not meet the preset use condition, acquiring an example sample corresponding to the processing task, and updating the task guide information by using the example sample to obtain updated task guide information; inputting the updated task guide information and the original text content into an autoregressive language model; and determining processing tasks based on the task execution sequence through a feature extraction network, extracting feature information meeting the processing tasks from the original text content according to the example samples, and processing the feature information through the processing network according to the example samples and processing operations corresponding to task requirements to obtain target text content.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a computer device according to an alternative embodiment of the disclosure, as shown in fig. 5, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 5.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.
The presently disclosed embodiments also provide a computer readable storage medium, and the methods described above according to the presently disclosed embodiments may be implemented in hardware, firmware, or as recordable storage medium, or as computer code downloaded over a network that is originally stored in a remote storage medium or a non-transitory machine-readable storage medium and is to be stored in a local storage medium, such that the methods described herein may be stored on such software processes on a storage medium using a general purpose computer, special purpose processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Although embodiments of the present disclosure have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations are within the scope defined by the appended claims.

Claims (10)

1. A method for processing a speech recognition result, the method comprising:
obtaining a voice recognition result obtained by voice recognition of audio to be recognized, wherein the voice recognition result comprises original text content corresponding to the audio to be recognized;
acquiring task guide information corresponding to the original text content, wherein the task guide information at least comprises: processing a task and a task requirement of the processing task;
and extracting characteristic information meeting the processing task from the original text content, and processing the characteristic information according to processing operation corresponding to the task requirement to obtain target text content.
2. The method according to claim 1, wherein the acquiring the task guidance information corresponding to the original text content includes:
acquiring a task list, wherein the task list comprises at least one processing task and a task requirement corresponding to the processing task;
Reading the task list to obtain a task identifier corresponding to each processing task and a task requirement;
determining a task execution sequence of the processing task by utilizing the task identification;
and generating the task guide information based on the task execution sequence, the processing task and the task requirement of the processing task.
3. The method according to claim 2, wherein the extracting feature information satisfying the processing task from the original text content, and processing the feature information according to the processing operation corresponding to the task requirement, to obtain the target text content, includes:
obtaining a pre-trained autoregressive language model, wherein the autoregressive language model comprises: a feature extraction network and a processing network;
carrying the task guide information in the original text content, and inputting the original text content carrying the task guide information into the autoregressive language model;
determining processing tasks based on the task execution sequence through the feature extraction network, extracting feature information meeting the processing tasks from the original text content, and processing the feature information through the processing network according to processing operations corresponding to task requirements to obtain target text content.
4. A method according to claim 3, wherein determining, by the feature extraction network, a processing task based on the task execution order, extracting feature information satisfying the processing task from the original text content, and processing, by the processing network, the feature information according to a processing operation corresponding to the task requirement, to obtain a target text content, includes:
determining a first processing task based on the task execution sequence through the feature extraction network, extracting first feature information which meets the first processing task from the original text content, and transmitting the first feature information to the processing network;
determining a first processing operation corresponding to a task requirement of the first processing task through the processing network, processing the first characteristic information according to the first processing operation to obtain first text content, and transmitting the first text content to the characteristic extraction network;
determining a second processing task based on the task execution sequence through the feature extraction network, extracting second feature information which meets the second processing task from the first text content, and transmitting the second feature information to the processing network;
Determining a second processing operation corresponding to the task requirement of the second processing task through the processing network, and processing the second characteristic information according to the second processing operation to obtain second text content;
and when the second processing task is the last processing task in the task execution sequence, taking the second text content as the target text content.
5. The method of claim 3, wherein said carrying the task guidance information on the original text content and inputting the original text content carrying the task guidance information into the autoregressive language model comprises:
obtaining splitting requirements corresponding to the original text content;
splitting the original text content according to the splitting requirement to obtain a plurality of text fragments;
and carrying the task guide information on each text segment of the original text content, and sequentially inputting each text segment carrying the task guide information into the autoregressive language model.
6. The method of claim 5, wherein the method further comprises:
obtaining target segment content output by the autoregressive language model for each text segment;
And splicing the target segment content corresponding to the text segment according to the splitting sequence of the text segment to obtain the target text content.
7. A method according to claim 3, characterized in that the method further comprises:
detecting whether the target text content meets preset use conditions or not to obtain a detection result;
acquiring an example sample corresponding to the processing task when the detection result is that the target text content does not meet the preset use condition, and updating the task guide information by using the example sample to obtain updated task guide information;
inputting the updated task guide information and the original text content into the autoregressive language model;
and determining processing tasks based on the task execution sequence through the feature extraction network, extracting feature information meeting the processing tasks from the original text content according to the example samples, and processing the feature information through the processing network according to the example samples and processing operations corresponding to task requirements to obtain target text content.
8. A device for processing speech recognition results, the device comprising:
The device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a voice recognition result obtained by voice recognition of audio to be recognized, and the voice recognition result comprises original text content corresponding to the audio to be recognized;
the processing module is used for acquiring task guide information corresponding to the original text content, wherein the task guide information at least comprises: task execution sequence, at least one processing task and task requirements of the processing task;
and the execution module is used for extracting the characteristic information meeting the processing task from the original text content according to the task execution sequence, and processing the characteristic information according to the processing operation corresponding to the task requirement to obtain the target text content.
9. A computer device, comprising:
a memory and a processor in communication with each other, the memory having stored therein computer instructions which, upon execution, cause the processor to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
CN202310943642.4A 2023-07-28 2023-07-28 Processing method and device of voice recognition result, computer equipment and storage medium Pending CN116976311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310943642.4A CN116976311A (en) 2023-07-28 2023-07-28 Processing method and device of voice recognition result, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310943642.4A CN116976311A (en) 2023-07-28 2023-07-28 Processing method and device of voice recognition result, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116976311A true CN116976311A (en) 2023-10-31

Family

ID=88472604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310943642.4A Pending CN116976311A (en) 2023-07-28 2023-07-28 Processing method and device of voice recognition result, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116976311A (en)

Similar Documents

Publication Publication Date Title
US10515627B2 (en) Method and apparatus of building acoustic feature extracting model, and acoustic feature extracting method and apparatus
CN108847241B (en) Method for recognizing conference voice as text, electronic device and storage medium
CN110428818B (en) Low-resource multi-language voice recognition model and voice recognition method
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN110444198B (en) Retrieval method, retrieval device, computer equipment and storage medium
US11055527B2 (en) System and method for information extraction with character level features
WO2020108063A1 (en) Feature word determining method, apparatus, and server
CN110765759B (en) Intention recognition method and device
CN110046254B (en) Method and apparatus for generating a model
CN111339268B (en) Entity word recognition method and device
US20210390370A1 (en) Data processing method and apparatus, storage medium and electronic device
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
JP2021192277A (en) Method for extracting information, method for training extraction model, device, and electronic apparatus
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN111160004B (en) Method and device for establishing sentence-breaking model
CN111144102B (en) Method and device for identifying entity in statement and electronic equipment
CN112001175A (en) Process automation method, device, electronic equipment and storage medium
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN116108857B (en) Information extraction method, device, electronic equipment and storage medium
CN110910903A (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN114639386A (en) Text error correction and text error correction word bank construction method
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN112988753A (en) Data searching method and device
CN112270169B (en) Method and device for predicting dialogue roles, electronic equipment and storage medium
CN113076749A (en) Text recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination