CN116434753B - Text smoothing method, device and storage medium - Google Patents

Text smoothing method, device and storage medium Download PDF

Info

Publication number
CN116434753B
CN116434753B CN202310682675.8A CN202310682675A CN116434753B CN 116434753 B CN116434753 B CN 116434753B CN 202310682675 A CN202310682675 A CN 202310682675A CN 116434753 B CN116434753 B CN 116434753B
Authority
CN
China
Prior art keywords
text
smooth
result
smoothing
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310682675.8A
Other languages
Chinese (zh)
Other versions
CN116434753A (en
Inventor
徐成国
崔和涛
张云柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202310682675.8A priority Critical patent/CN116434753B/en
Publication of CN116434753A publication Critical patent/CN116434753A/en
Application granted granted Critical
Publication of CN116434753B publication Critical patent/CN116434753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text smoothing method, equipment and a storage medium, which are used for obtaining a text to be processed by carrying out voice recognition on audio; performing label prediction on a text smooth model to be processed to obtain a first smooth result, wherein a training set of the text smooth model is obtained by adding noise to sample data through an iteratively updated noise adding rule, the iteratively updated noise adding rule is obtained by performing deviation analysis on a predicted label and a real label, and the predicted label is obtained by performing label prediction on a test set through the text smooth model in the training process; and taking the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result. According to the application, based on the iterative updating and noise adding rule, the sample data is subjected to noise adding, the rule coverage area is enlarged, the training data is increased, the performance of the text smoothing model is improved, and the text smoothing effect is improved.

Description

Text smoothing method, device and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a text smoothing method, apparatus, and storage medium.
Background
Automatic speech recognition (Automatic Speech Recognition) is a technique for converting speech to text to facilitate a user's quick navigation through and understanding of recorded information. However, the text converted from the audio input by the user is usually a text with a high degree of spoken language, and a very large number of words and repeated words exist to influence the look and feel of the transcribed text.
The deletion of the useless words is completed through a series of smooth standards and rules such as word segmentation, regularization and the like, so that smooth texts are obtained. The method has the problems that the text smoothness effect is limited because the method needs to complete the deletion of the unnecessary words based on the established fixed rules and the coverage of the rules is limited and the generalization is lacking.
Disclosure of Invention
The application provides a text smoothing method, equipment and a storage medium, which aim to solve the problem that the effect of text smoothing is limited in the voice text conversion process.
In order to achieve the above purpose, the application adopts the following technical scheme:
first aspect: the embodiment of the application provides a text smoothing method, which is used for obtaining a text to be processed by carrying out voice recognition on audio; performing label prediction on a text smooth model to be processed to obtain a first smooth result, wherein a training set of the text smooth model is obtained by adding noise to sample data through an iteratively updated noise adding rule, the iteratively updated noise adding rule is obtained by performing deviation analysis on a predicted label and a real label, and the predicted label is obtained by performing label prediction on a test set through the text smooth model in the training process; and taking the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result.
According to the method, label prediction is carried out on the text to be processed based on the text smooth model, and in the process of obtaining the text smooth model through multiple rounds of iterative training, the noise adding rule is updated in an iterative mode according to the deviation between a predicted label and a real label by utilizing the thought of active learning; based on the iterative updating noise adding rule, the sample data is subjected to noise adding, so that the rule coverage range is enlarged, the data richness is expanded, the training data is increased, the performance of the text smoothing model is improved, and the smoothing effect of the voice text is further improved.
In one possible implementation manner, label prediction is performed on the text to be processed input text smooth model, and after the first smooth result is obtained, the method further includes: word segmentation processing is carried out on the text to be processed, and word segmentation text is obtained; according to rules for deleting useless characters in the text in the rule engine, performing smooth processing on the segmented text to obtain a second smooth result; combining the first smoothing result with the second smoothing result to obtain a third smoothing result; according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result, taking the text corresponding to the low confusion degree as the smooth text, and comprising the following steps: and taking the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the third smooth result. According to the method, the word segmentation text is subjected to smooth processing based on the rule for deleting the useless characters in the text, so that a second smooth result is obtained, and the second smooth result is combined with the first smooth result, so that the text smooth effect is further improved.
In one possible implementation manner, label prediction is performed on a text to be processed input text smooth model to obtain a first smooth result, which includes: sentence processing is carried out on the text to be processed, and sentence text is obtained; and inputting the clause text into a text smooth model to conduct label prediction, so as to obtain a first smooth result.
In one possible implementation manner, after taking the text corresponding to the low confusion degree as the smooth text, according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result, the method further includes: word segmentation processing is carried out on the sentence text, and word segmentation text corresponding to the sentence text is obtained; according to a rule for deleting useless characters in the text in the rule engine, performing smooth processing on the word segmentation text corresponding to the sentence text to obtain a second smooth result corresponding to the sentence text; combining the second smooth result corresponding to the clause text with the first smooth result to obtain a third smooth result corresponding to the clause text; according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result, taking the text corresponding to the low confusion degree as the smooth text, and comprising the following steps: and according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the third smooth result corresponding to the clause text, taking the text corresponding to the low confusion degree as the smooth text.
Since the overlong text to be processed may affect the efficiency and effect of text smoothness, in the embodiment of the application, after the text to be processed is obtained, the sentence processing is performed on the text to be processed. I.e. the text to be processed is divided into several clause texts. On the basis, the text of each clause is subjected to smooth processing, so that the smooth effect of the text is further improved.
In one possible implementation manner, the rule for deleting the useless characters in the text in the rule engine includes deleting repeated words in the segmented text, deleting the word words in the segmented text and recovering the special words, and performing smooth processing on the segmented text according to the rule for deleting the useless characters in the text in the rule engine to obtain a second smooth result, where the method includes: deleting repeated words and/or intonation words in the word segmentation text according to the repeated words and the intonation words in the word segmentation text, and obtaining a second smooth result; combining the first smoothing result with the second smoothing result to obtain a third smoothing result, comprising: combining the first smoothing result with the second smoothing result to obtain a sub-smoothing result; and recovering the special words in the sub-smooth results according to the recovered special words to obtain a third smooth result.
In one possible implementation, the training step of the text smoothing model includes: noise is added to the sample data to obtain a training set and a testing set; extracting features of the noisy text in the training set to obtain a feature matrix corresponding to the noisy text; decoding and predicting the feature matrix to obtain an output label; performing iterative training on the text smooth initial model according to the weighted cross entropy loss value between the output label and the real label to obtain a text smooth model of the intermediate edition; inputting the test set into a text smooth model of the middle edition to obtain a prediction label; iteratively updating a preset noise adding rule based on a first deviation analysis result obtained by carrying out deviation analysis on the prediction label and the real label to obtain an iteratively updated noise adding rule; according to the iterative updating noise adding rule, noise adding is carried out on the sample data to obtain an updated training set; and performing iterative training on the intermediate version text smooth model based on the updated training set to obtain the text smooth model. According to the application, based on the iterative updating noise adding rule, the sample data is subjected to noise adding, so that the rule coverage range is enlarged, the data richness is expanded, the training data is increased, the performance of the text smoothing model is improved, and the smoothing effect of the voice text is further improved.
In one possible implementation, based on the updated training set, performing iterative training on the text smoothing model of the intermediate version to obtain the text smoothing model, including: based on the updated training set, performing iterative training on the intermediate version text smooth model to obtain a trained text smooth model; combining the test set with the scene test set to obtain a test set of a fusion scene, and combining the sample data with the scene training set to obtain sample data of the fusion scene; inputting the test set fused with the scene into a trained text smooth model to obtain a test tag; expanding the iteratively updated noise rule based on a second deviation analysis result obtained by performing deviation analysis on the test tag and the real tag, and obtaining the expanded noise rule; according to the extended noise adding rule, noise adding is carried out on sample data of the fusion scene, and a training set of the fusion scene is obtained; training the trained text smooth model based on the training set of the fusion scene to obtain the text smooth model. According to the method, the trained text smooth model is trained based on the sample data and the scene data, and the field migration fine adjustment is realized.
In one possible implementation manner, the method combines the test set with the scene test set to obtain a test set of the fused scene, combines the sample data with the scene training set to obtain the sample data of the fused scene, and further includes: performing voice recognition on the scene data to obtain a scene text corresponding to the scene data; and labeling the scene text to obtain a scene test set and a scene training set.
Second aspect: the embodiment of the application provides electronic equipment, which comprises a processor and a memory: the memory is used for storing the program codes and transmitting the program codes to the processor; the processor is configured to perform the steps of a text smoothing method as described in the first aspect above according to instructions in the program code.
Third aspect: an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of a text smoothing method as described in the first aspect above.
Drawings
FIG. 1 is a schematic diagram of a speech-to-text scene according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a text label according to an embodiment of the present application;
FIG. 3 is a flowchart of a text smoothing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a word segmentation process for obtaining a word segmentation text according to an embodiment of the present application;
FIG. 5 is a flowchart of a text smoothing model training process for an intermediate version provided by an embodiment of the present application;
FIG. 6 is a flowchart of a text smoothing model training process provided by an embodiment of the present application;
FIG. 7 is a flowchart of another text smoothing method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 9 is a software structure block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The terms first, second, third and the like in the description and in the claims and in the drawings are used for distinguishing between different objects and not for limiting the specified order.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
The text smoothing method aims at identifying and deleting spoken language phenomena such as repetition, pause, correction, redundancy and the like contained in the text to be processed obtained by carrying out voice recognition on the audio, so that the text to be identified in spoken language is more written and normalized, the readability and the understandability of the text are improved, and the user experience is improved.
At present, text smoothing methods can be mainly divided into two types of supervised text smoothing methods and unsupervised or self-supervised text smoothing methods, and the two types of text smoothing methods are briefly described below.
First category: the supervised text smoothing method mainly comprises a text smoothing method based on pure rules and a text smoothing method based on a depth model.
According to the text smoothing method based on the pure rules, a series of smoothing standards are formulated, and unnecessary words are deleted according to the rules of word segmentation, regularization and the like, so that a smooth text is obtained. However, in the text smoothing method based on the pure rules, the rule coverage is limited, generalization is avoided, meanwhile, the rules are difficult to formulate, and a large amount of corpus analysis is needed to formulate the rule range so as to influence the text smoothing effect.
The text smoothing method based on the supervised depth model needs to convert a great number of labeling predictions into sequence labeling tasks to complete label prediction and delete useless words. The method has the problems that the labeling corpus is difficult to obtain and the labeling workload is large. Meanwhile, under the condition that the distribution of useless words to be deleted is less than that of normal words, the problem of sparse training data distribution is easy to occur, so that the smooth effect of texts is reduced.
The second category: an unsupervised or self-supervised text smoothing method mainly comprises a text smoothing method based on an unsupervised or self-supervised depth model. Unlike supervised text smoothing methods, self-supervised text smoothing methods can learn using the structure of the data itself, without training a model based on a large amount of labeling data, which is typically based on corpus generalization, performing unsupervised or self-supervised sequence labeling, thereby achieving text smoothing. However, generalization and generation of the corpus can be realized by matching data distribution of a complex scene through complex algorithm design, and meanwhile, the quality of the generated corpus is uncontrollable, so that the smooth effect of the text is affected.
Based on the method, in order to enable the text converted by the audio to be smoother, improve the readability of the text and improve the user experience, the text to be processed after the voice recognition is input into a text smoothing model to obtain a first smoothing result, and the text corresponding to the low confusion degree is taken as the smoothing text according to the confusion degree of the original text corresponding to the text to be processed and the confusion degree of the text corresponding to the first smoothing result. The text smooth model adopts sample data to perform initial training, a training set of the text smooth model is obtained by adding noise to the sample data through an iteratively updated noise adding rule, the iteratively updated noise adding rule is obtained by performing deviation analysis on a predicted label and a real label, and the predicted label is obtained by performing label prediction on a test set through the text smooth model in the training process. Because the idea of active learning is utilized in the process of multi-round iterative training, the noise adding rule is updated in an iterative way according to the deviation between the predicted label and the real label; based on the iterative updating noise adding rule, the sample data is subjected to noise adding, so that the rule coverage range is enlarged, the data richness is expanded, the training data is increased, the performance of the text smoothing model is improved, and the smoothing effect of the voice text is further improved.
The method provided by the application can be applied to various electronic devices supporting voice text conversion, including but not limited to mobile phones, tablet computers, desktop computers, laptops, notebook computers, ultra-mobile personal computers (UMPC), handheld computers, netbooks, personal digital assistants (Personal Digital Assistant, PDA), wearable electronic devices and the like, and the application is not limited to the specific form of the electronic devices.
The following describes an application scenario of the text smoothing method provided by the present application with reference to fig. 1 and fig. 2, taking a voice-text conversion function of a note application in a mobile phone as an example.
Fig. 1 is a schematic diagram of a speech-to-text conversion scenario according to an embodiment of the present application. The user clicks the note application on the user interface of the mobile phone, enters the note application interface, presses the microphone key for a long time, and simultaneously inputs voice. As shown in the figure, the voice input by the user is "then" and the cost performance of the mobile phone is very high. After the mobile phone carries out voice recognition on the voice input by the user, a non-smooth text with a mood word and a repeated word is obtained, and the non-smooth text is subjected to smooth processing based on the text smooth method provided by the application, so that a smooth text is obtained and output. After the user inputs the voice, the user can see the smooth text after the smooth processing on the user interface of the note application, namely, the word of 'then the very high cost performance of the mobile phone' can be displayed on the user interface of the note application.
As shown in FIG. 2, the diagram is a schematic diagram of a text label according to an embodiment of the present application. The non-smooth text directly converted by the audio input by the user is ' then ' the cost performance of the mobile phone is very high ', and ' O ' and ' D ' in the figure are labels for indicating that the corresponding characters are reserved characters or useless characters, wherein the characters corresponding to ' O ' are reserved characters and are required to be reserved; the character corresponding to "D" is a useless character, i.e., a character that needs to be deleted. In general, the term and the repeated term are useless words, the term needs to be deleted directly, and the repeated term needs to be deleted, so that one term is reserved. If the 'o' in the figure is a word of the Chinese language, the word needs to be deleted; the word "this" is a repeated word and requires deletion of one "this". The text smoothing method provided by the application is used for carrying out smoothing processing on the non-smooth text to obtain a smooth text, and then the cost performance of the mobile phone is very high, so that the readability of the text can be improved, the manual modification of a user is reduced, and the user experience is improved.
Embodiment one:
the text smoothing method provided by the application is shown in fig. 3, which is a flow chart of the text smoothing method provided by the embodiment of the application.
S301, the electronic equipment receives audio input by a user.
S302, after receiving the audio input by the user, the electronic equipment performs voice recognition on the audio based on an automatic voice recognition (ASR) technology to obtain a text to be processed.
The text to be processed is a text directly converted from audio, and is usually a relatively spoken text due to the spoken habit of the user, and may contain words affecting the smoothness of the text, such as word of the mood, repeated words, etc.
S303, the electronic equipment inputs the text to be processed into a text smoothness model to conduct label prediction, and a first smoothness result is obtained.
S304, inputting the text to be processed into the text smooth model for label prediction, and simultaneously, performing word segmentation on the text to be processed by the electronic equipment to obtain a word segmentation text.
Exemplary, as shown in fig. 4, the figure is a schematic diagram of word segmentation processing to obtain word segmentation text according to the embodiment of the present application. As shown in the figure, the text to be processed is "then" the cost performance of the mobile phone is very high ", and word segmentation processing is performed on the text to be processed, so that word segmentation text can be obtained. Several words are distinguished in the word segmentation text, such as "then", "o", "just", "this", "cell phone", "cost performance" and "very high".
S305, after the word segmentation text is obtained, the electronic equipment performs smoothing processing on the word segmentation text based on rules of a rule engine to obtain a second smoothing result.
The Rule Engine is developed by an inference Engine, is a component embedded in an application program, and is used for separating business decisions from application program codes and writing the business decisions by using a predefined semantic module. And receiving data input, interpreting the business rule, and making a business decision according to the business rule. In the embodiment of the application, the rule engine is introduced, so that the system architecture can be simplified to a certain extent, the application is optimized, and the maintainability of the system is improved. Meanwhile, the rules of the rule engine can be flexibly changed without rewriting the written codes, so that the cost and risk of writing hard code business rules can be reduced. The rules of the rules engine are specifically rules for deleting useless characters in text.
By way of example, rules in the rules engine for deleting useless characters in text may include:
1. the repeated words in the word segmentation text are deleted, namely, when the word segmentation text has continuous repeated words, only one repeated word is deleted, and the text "this" is obtained from the text "this" after deleting the repeated words and only one repeated word is deleted, for example, the "this" in fig. 4 is the continuous repeated word.
2. The word segmentation text in fig. 4 is deleted, that is, a large-range word blacklist is preset, if the word segmentation text encounters a word in the blacklist, the word is deleted, for example, the blacklist includes words of "o", "he", "singult" and the like, and when the word segmentation text in fig. 4 detects the word of "o", the word of "o" is deleted.
3. Restoring special words, namely setting a special word white list, filtering the special words in the white list, setting the special words to be not deleted, such as 'nuer Ha Chi', wherein 'hash' words are possibly determined to be useless words and deleted, and putting 'nuer Ha Chi' into the special word white list so that 'hash' words in the special words are not deleted.
S306, after the first smooth result and the second smooth result are obtained, the first smooth result and the second smooth result can be combined, and a third smooth result is obtained.
By way of example, taking the three rules as examples, repeating words in the word segmentation text are deleted to obtain a first sub-text, the mood words in a blacklist contained in the word segmentation text are deleted to obtain a second sub-text, and a second smooth result obtained based on a text smooth model is a third sub-text. And combining the first sub-text, the second sub-text and the third sub-text to obtain a sub-smooth result, wherein the sub-smooth result is equivalent to checking whether the un-deleted repeated words exist and/or the un-deleted words in the blacklist based on the second smooth result output by the model. On the basis, in order to further improve the accuracy of text smoothing, a sub-smoothing result obtained by combining the first sub-text, the second sub-text and the third sub-text is recovered based on the special word white list, and a third smoothing result is obtained. Therefore, special words in the special word white list are protected from being deleted by mistake, and the text smooth effect is improved.
Specifically, in the process of performing iterative training for multiple times to obtain the text smooth model, the embodiment of the application uses the idea of active learning to iteratively update the noise adding rule; based on the iterative updating noise adding rule, the sample data is subjected to noise adding, so that the richness of the training data is expanded, the performance of the model is improved, and the smooth effect of the voice text is further improved.
As shown in FIG. 5, the figure is a flow chart of a text smoothing model training process for an intermediate version provided by an embodiment of the present application.
First, initial training is performed based on sample data to obtain a text smoothing model of the intermediate version. The sample data may be open source data, which refers to data legally collected from resources obtained from public and publicly available channels. In short, open source data is data that anyone can access, modify, reuse, and share. The method has the distinct characteristics of sea quantization, fragmentation and the like, has huge open source data volume, and provides a large amount of data sources for training a text smooth model.
The open source data will be described below as an example. In order to improve the text smoothing effect, the open source data used for training needs to be smooth text. For example, since news class data is typically more canonical, news class data may be employed for initial training to obtain a middle version of the text smoothing model.
Specifically, based on a preset noise adding rule, noise adding processing is performed on sample data, and a test set and a training set are obtained. Specifically, sample data and a preset noise adding rule are input into a loader to generate supervision data, wherein the supervision data comprises a test set and a training set, the test set and the training set contain noise adding text, the noise adding text is the sample data after noise adding, and the test set is fixed after the first generation. In the initial training process, some preset noise adding rules can be preset.
Illustratively, the preset noise adding rule may include at least the following five types:
after the sample data is segmented, randomly selecting the position of an index, randomly selecting the Chinese words from the Chinese word list, and randomly inserting the Chinese words into the position of the current index; after word segmentation is carried out on sample data, randomly selecting an index position, selecting a phrase corresponding to the current index position, randomly repeating for 1-3 times, and inserting the current index position, wherein the original noisy text is obtained by randomly repeating the index position corresponding to the sense for two times, namely, the sense for the sense is good for the text, and the sense for the sense is exemplified; after word segmentation is carried out on the sample data, randomly selecting the position of a first index, selecting a current phrase, randomly selecting the position of a second index, and inserting a phrase corresponding to the first index into the position of the second index; randomly selecting the position of an index without word segmentation, randomly selecting a Chinese word from a Chinese word list, and inserting the Chinese word into the position of the current index; after word segmentation, randomly selecting the position of an index, selecting a phrase corresponding to the position of the current index, randomly repeating for 1-3 times, inserting the position of the current index, and if the text is 'i feel good', selecting the position of the index corresponding to the 'feel' word, and randomly repeating the 'feel' twice, thus obtaining the original noisy text 'i feel good'.
After the training set is obtained, the noise-added text in the training set is subjected to Token-level-based sequence labeling, namely, feature extraction is performed on the noise-added text in the training set, and a feature matrix corresponding to the noise-added text is obtained. The Token is a minimum semantic unit, and a chinese character is generally regarded as a Token.
Illustratively, bi-directional encoder characterizations (Bidirectional Encoder Representations From Transformers, BERT) models from transformers use pre-training and fine tuning to accomplish natural language processing (Natural Language Processing, NLP) tasks including question-and-answer systems, emotion analysis, and language reasoning, among others. A lightweight version of a bi-directional encoder characterization (ALBERT) model from a converter is a lightweight version of a BERT model, which utilizes better parameters to train the model than the BERT model, and the core idea is to use two methods for reducing model parameters, which occupy less memory space than the BERT, and greatly improve training speed and effect.
The application can use ALBERT model as encoder to extract the characteristic of original text, and can obtain characteristic matrix [ e ] composed of 15 characteristic vectors after using ALBERT model as encoder to extract the characteristic of noise-added text 1 e 2 e 3 e 4 e 5 e 6 e 7 e 8 e 9 e 10 e 11 e 12 e 13 e 14 e 15 ]. Each feature vector in the feature matrix corresponds to a word. For example e 1 Corresponding to "natural", e 2 Corresponding to "back", and so on.
Inputting the feature matrix corresponding to the noisy text into a feedforward neural network (FFN), and decoding and predicting the feature matrix extracted by the encoder by using the feedforward neural network as a decoder to obtain an output label corresponding to each feature vector. Where the output label ranges from { O, D }, O means to hold the current character, and D means to delete the current character. For example, the labels corresponding to "ok", "back" in the above noisy text are O, and the labels corresponding to the word "O" are "D".
And carrying out iterative training on the text smoothing initial model according to the weighted cross entropy loss value between the output label and the real label to obtain a text smoothing model of the intermediate edition. Specifically, the fit objective in the training process is defined by a Weighted cross entropy loss function (Weighted-Cross Entropy Loss) between the output tag and the real tag, where the Weighted cross entropy loss function can be represented by equation (1) as follows: l= - (W) 1 *ylog(p)+W 2 *(1-y)log(1-p))(1)
Wherein L represents a weighted cross entropy loss value between the output tag and the real tag; p represents the probability that the output label corresponding to the feature vector is "D", namely the probability that the output label corresponding to the feature vector is an useless character; y represents the feature vector correspondence If the real label is "D", y=1; if the real label is "O", y=0; w (W) 1 And W is 2 Respectively represent the weight corresponding to the character with the output label of "D" and the weight corresponding to the character with the output label of "O", and the example is W 1 Preset to 1.5, W 2 The preset value is 0.2, and the weighting is used for enabling the text smoothing mode to pay more attention to the character with the output label of D, so that the problem of poor text smoothing effect caused by sparse training data can be solved, and the text smoothing effect is not affected under the condition that the distribution of useless characters to be deleted is less compared with that of normal characters to be reserved.
After each round of data training is completed, a new round of noise adding processing is carried out on the sample data based on a preset noise adding rule, so that the data training process of each round is carried out on different noise adding texts which are generated based on homologous data and are distributed uniformly, and the generalization of a text smooth model can be effectively improved.
After a preset number of exercises, the model tends to fit, illustratively after each about 20 epochs, to obtain a text-smoothing model of the intermediate version. Where 1epoch is equal to the process of training once with all noisy text in the training set.
Inputting the test set into a text smooth model of the intermediate edition to obtain a prediction label, performing deviation analysis on the prediction label and a real label to obtain a first deviation analysis result, and according to the first deviation analysis result, pertinently modifying the noise rule to obtain an iteratively updated noise rule. And carrying out noise addition on the sample data based on the noise addition rule updated by iteration to obtain an updated training set and an updated test set, and carrying out iterative training on the text smooth model of the intermediate version based on the method so as to obtain the text smooth model.
In one possible implementation, the corresponding proportion may be modified for the type of noise-adding rule during the next round of training. For example, in the process of performing deviation analysis on the predicted tag and the real tag, it is found that when a random word is inserted at a position of a random index of an original text, a sentence head is inserted into the word, and a repeated word is inserted at a position of the random index, the predicted tag is consistent with the real tag, but when a repeated word is inserted at a position of the random index of the original text, the predicted tag corresponding to the repeated word is deviated from the real tag corresponding to the repeated word, and the current text smoothing model is considered to have poor text smoothing effect when the repeated word is inserted at the position of the random index. Thus, the noise addition rule can be updated for this type of noise addition rule.
For example, when the noise adding process is performed on the noise added text, the random word is inserted at the position of the random index of the noise added text, the word is inserted at the sentence head, the proportion of the repeated word inserted at the position of the random index is adjusted to be 20%, and the proportion of the repeated word inserted at the position of the random index of the original text is adjusted to be 40%, so that the training on the aspect of inserting the repeated word at the position of the random index of the original noise added text can be enhanced by the current text smooth model, the repeated word can be better identified, and the accuracy of the predictive label is improved.
In one possible implementation, the noise rule may be newly added for the type of the noise rule during the next training round to obtain an iteratively updated noise rule. In the process of carrying out deviation analysis on the predicted label and the real label, when the specific word is found to be used as a sentence head, the deviation exists between the test label and the real label, and then the specific word is added at the sentence head to be used as a newly added noise adding rule in the next training process. Therefore, the problem of insufficient training data is solved, and generalization of the text smooth model is improved.
Illustratively, in addition to the above 5 types of preset noise adding rules, the iteratively updated noise adding rules may further include a specific word as a sentence head, a negative example word as a sentence head, an AAB form word segmentation (e.g., best, etc.) insertion, an AABAA form insertion specific word, a negative example word random insertion word segmentation position, a natural language word random repeated insertion, a specific natural language word format insertion (e.g., natural language word+punctuation, punctuation+natural language word), and the like. It should be noted that the above-mentioned noise adding rules are examples.
Based on the method, the intermediate version text smooth model is subjected to iterative training, during training, sample data is subjected to noise addition based on the noise addition rule updated iteratively, and the noise addition rule updated iteratively is subjected to iterative updating modification based on the deviation analysis result of each round of training. The problem of insufficient training data is solved through a self-supervision mode, the 'expert knowledge' fusion is realized through active learning, and the data distribution of insufficient text smooth model fitting is reinforced, so that the performance of the text smooth model is effectively improved, and the text smooth effect can be improved.
In one possible implementation manner, in order to make the text smoothing model suitable for various scenes, based on the updated training set, iterative training is performed on the intermediate version text smoothing model to obtain a trained text smoothing model, the trained text smoothing model tends to fit, and at this time, field migration fine tuning is performed by fusing scene input, as shown in fig. 6, which is a flowchart of a text smoothing model training process provided by an embodiment of the present application.
Specifically, based on an ASR technology, performing voice recognition on the scene data to obtain a scene text corresponding to the scene data. And labeling the scene text to obtain a scene training set and a scene testing set in the vertical field of the scene.
And combining the scene training set with the sample data to obtain the sample data of the fusion scene. And combining the scene test set and the test set to obtain a test set of the fusion scene, inputting the test set of the fusion scene into the trained text smooth model, and carrying out label prediction to obtain a test label. And performing deviation analysis on the test tag and the real tag to obtain a second deviation analysis result, and expanding the iteratively updated noise rule based on the second deviation analysis result to obtain an expanded noise rule.
And based on the extended noise adding rule, noise adding is carried out on the sample data of the fusion scene, and a training set of the fusion scene is generated. The extended noise adding rule is iteratively updated based on a second deviation analysis result obtained by performing deviation analysis on the test tag and the real tag in each round of training, so that on one hand, the field migration fine adjustment can be realized, on the other hand, the model can be reinforced, the model can be trained in a targeted manner, the performance of the model is improved, and further, the text smoothing effect is improved.
And repeating the training process based on the training set of the fusion scene, performing iterative training on the trained text smooth model, and considering model fitting when no new deviation analysis result is generated, so as to obtain a final text smooth model.
In the embodiment of the application, a text to be processed is input into a final text smoothing model to obtain a first smoothing result.
S307, the electronic device performs confusion degree calculation on a third smoothness result obtained by combining the first smoothness result with a second smoothness result obtained based on the rule engine, and text confusion degree corresponding to the third smoothness result is obtained.
And S308, the electronic equipment calculates the confusion degree of the text to be processed, and the original text confusion degree is obtained.
The text confusion (PPL), which is an index for evaluating the performance of a language model, is essentially the probability of computing sentences. For example, for a sentence S composed of characters W1 to Wk (k is a positive integer), the following can be expressed by formula (2): s=w1, W2, … …, wk (2)
The probability of sentence S can be expressed by formula (3) as P (S) =P (W1, W2, … …, wk) =P (W1) P (W2|W1) … … P (Wk|W1, W2, … …, wk-1) (3)
The magnitude of the confusion is related to the probability of the sentence, and the larger the probability of the sentence is, the smaller the confusion is, and the smoother the sentence is.
S309, judging whether the text confusion degree corresponding to the third smooth result is larger than the original text confusion degree.
For example, if the text to be processed is: a1 A2, A3, A4, A5, A6, A7, A8, A9, a10. For three texts: a1 A2, A4, A5, A6, A7, A8, A9, a10; a1 A2, A3, A4, A5, A6, A7, A9, a10; a1 A2, A4, A5, A6, A7, A9 and A10 respectively correspond to the confusion degree PPL1, the confusion degree PPL2 and the confusion degree PPL3, the confusion degree PPL0 of the original text corresponding to the text to be processed is compared, and the text corresponding to the lowest value in PPL0-3 is selected as a final result.
If the text confusion degree corresponding to the third smooth result is greater than the original text confusion degree, executing step S310, restoring and deleting, and outputting a text to be processed; if the text confusion degree corresponding to the third smooth result is less than or equal to the original text confusion degree, step S311 is executed, and the deleted text is output, that is, the text corresponding to the third smooth result is output, and is taken as the final smooth text.
In one possible implementation, the confusion degree calculation may be performed on the first smooth result, so as to obtain the text confusion degree corresponding to the first smooth result. If the text confusion degree corresponding to the first smooth result is larger than the original text confusion degree, deleting is restored, and a text to be processed is output; and if the text confusion degree corresponding to the first smooth result is smaller than or equal to the original text confusion degree, outputting the deleted text, namely outputting the text corresponding to the first smooth result, and taking the text as a final smooth text.
In summary, in the process of obtaining the text smooth model through multiple rounds of iterative training, the application uses the idea of active learning to iteratively update the noise adding rule according to the deviation between the predicted label and the real label; based on the iterative updating noise adding rule, the sample data is subjected to noise adding, so that the rule coverage range is enlarged, the data richness is expanded, the training data is increased, the performance of the text smoothing model is improved, and the smoothing effect of the voice text is further improved. Meanwhile, training is carried out on the trained text smooth model based on sample data and scene data, so that the field migration fine adjustment is realized. On the basis, a rule engine is introduced, a text smoothing model is combined with the rule engine, and the final result is selected through the confusion degree calculation of the language model, so that the accuracy is further improved, and the text smoothing effect is improved.
Embodiment two:
since too long text to be processed may affect the efficiency and effect of text smoothness, unlike the above embodiment, in the embodiment of the present application, after obtaining the text to be processed, sentence processing needs to be performed on the text to be processed. I.e. the text to be processed is divided into several clause texts. On the basis, the text of each clause is subjected to smooth processing, so that the smooth effect of the text is further improved. A second embodiment of the present application will be described with reference to fig. 7, which is a flowchart of another text smoothing method provided in the embodiment of the present application. The same parts as those of the above embodiment are not described here again.
S401, the electronic equipment receives the audio input by the user.
S402, the electronic equipment carries out voice recognition on the audio to obtain a text to be processed.
S403, the electronic equipment performs sentence processing on the text to be processed to obtain sentence text. The sentence text may be composed of a plurality of sentences, and the sentence processing mode is not particularly limited in the present application, and the sentence processing may be exemplarily performed on the text to be processed based on one or more of a preset duration, a preset number of characters, and a pause duration.
When sentence processing is carried out on the text to be processed based on the preset time length, the integral multiple of the preset time length is the segmentation time point. If the duration from the starting time to the end of the audio is less than the preset duration, the starting time to the end of the audio is a clause. For example, a section of text to be processed converted from audio with a duration of 1 minute and 20 seconds is divided into 3 sentences by taking 30 seconds, 60 seconds and 90 seconds as dividing moments, the text converted from audio with the first 30 seconds is a sentence, the text converted from audio with the 31 st to 60 th seconds is a sentence, and the text to be processed converted from audio with the 60 th to 80 th seconds is a sentence, and only 20 seconds of effective audio is contained in 61 st to 90 th seconds. The preset time length can be set according to specific conditions.
When sentence processing is carried out on the text to be processed based on the preset character number, the first character in the text to be processed is taken as the starting point, and the integral multiple of the preset character number is taken as the dividing point of the text to be processed. If the total number of characters from the segmentation point to the end of the text to be processed is smaller than the preset number of characters, the segmentation point to the end of the text to be processed is a clause. For example, if the number of characters is 80 and the preset number of characters is 30, the text to be processed can be divided into 3 sentences by taking 30, 60 and 90 as dividing points, the text consisting of the first 30 characters is a clause, the text consisting of 31 st to 60 th characters is a clause, and the text consisting of 61 st to 80 th characters is a clause. The preset character number can be set according to specific situations.
And when the pause time exceeds a threshold value, sentence segmentation is performed. The threshold value can be set according to specific situations.
In the embodiment of the application, the clause text can be obtained based on one mode, and the three modes can be combined to obtain the clause text.
S404, after the sentence text is obtained, word segmentation processing is carried out on each sentence in the sentence text, and the word segmentation text is obtained.
S405, the electronic equipment performs smoothing processing on the segmentation text based on a rule for deleting useless characters in the text in a rule engine, and a second smoothing result corresponding to the segmentation text can be obtained.
S406, parallel to word segmentation processing to obtain word segmentation texts, namely parallel to S403, the electronic equipment inputs the sentence text into a text smoothing model to obtain a first smoothing result.
S407, the electronic equipment combines the second smooth result corresponding to the clause text with the first smooth result to obtain a third smooth result corresponding to the clause text.
In the embodiment of the present application, the confusion degree calculation is performed on the third smooth result corresponding to the sentence text in S408, so as to obtain the text confusion degree corresponding to the third smooth result corresponding to the sentence text. S409, calculating the confusion degree of the text to be processed, and obtaining the original text confusion degree.
S410, judging whether the text confusion degree corresponding to the third smooth result corresponding to the clause text is larger than the original text confusion degree.
If the text confusion degree corresponding to the third smooth result corresponding to the clause text is greater than the original text confusion degree, S411 is executed, deletion is restored, and a text to be processed is output; if the text confusion degree corresponding to the third smooth result corresponding to the clause text is smaller than or equal to the original text confusion degree, executing S412, and outputting the deleted text, namely outputting the text corresponding to the third smooth result corresponding to the clause text, and taking the text as the final smooth text.
In one possible implementation, the confusion degree calculation may be performed on the first smooth result, so as to obtain the confusion degree of the text corresponding to the first smooth result corresponding to the clause text. If the text confusion degree corresponding to the first smooth result is larger than the original text confusion degree, deleting is restored, and a text to be processed is output; and if the text confusion degree corresponding to the first smooth result is smaller than or equal to the original text confusion degree, outputting the deleted text, namely outputting the text corresponding to the first smooth result, and taking the text as a final smooth text.
In summary, the text smoothing model in the embodiment of the application performs noise addition on the sample data based on the noise addition rule updated iteratively in the training process, so that the problem of insufficient training data is solved, the generalization of the model is improved, and the homodistribution of the training set and the scene training set is improved. In the process of obtaining a text smooth model through multiple rounds of iterative training, according to the deviation between a predicted label and a real label, the noise adding rule is updated in an iterative mode, the self-supervision mode is combined with the idea of active learning, the problem of insufficient training data is solved through the self-supervision mode, the model performance is improved through active learning, and the data distribution with insufficient model fitting is reinforced. Meanwhile, in the embodiment of the application, on the basis of the first embodiment, in order to avoid the efficiency and the effect reduction of text smoothing caused by overlong text to be processed, sentence segmentation processing is performed on the text to be processed, and the text smoothing is obtained, so that the text smoothing efficiency and the text smoothing accuracy are further improved.
In some embodiments, the structure of the electronic device may be shown in fig. 8, and fig. 8 is a schematic structural diagram of the electronic device according to the embodiment of the present application.
As shown in fig. 8, the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, an audio module 130, a microphone 130A, a sensor module 140, a display 150, and the like. The sensor module 140 may include a pressure sensor 140A, a fingerprint sensor 140B, a touch sensor 140C, and the like.
It is to be understood that the configuration illustrated in this embodiment does not constitute a specific limitation on the electronic apparatus. In other embodiments, the electronic device may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors. For example, in the present application, the processor 110 performs voice recognition on the audio to obtain text to be processed; performing label prediction on a text smooth model to be processed to obtain a first smooth result, wherein a training set of the text smooth model is obtained by adding noise to sample data through an iteratively updated noise adding rule, the iteratively updated noise adding rule is obtained by performing deviation analysis on a predicted label and a real label, and the predicted label is obtained by performing label prediction on a test set through the text smooth model in the training process; and taking the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result.
The controller can be a neural center and a command center of the electronic device. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.
In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, and the like.
The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may couple the touch sensor 140C through an I2C interface, causing the processor 110 to communicate with the touch sensor 140C through an I2C bus interface, implementing the touch function of the electronic device. For example, in an embodiment of the present application, based on the touch function, a user may press a microphone key long in a user interface of a note application while inputting voice.
The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 130 via an I2S bus to enable communication between the processor 110 and the audio module 130. In some embodiments, the audio module 130 may communicate audio signals to the wireless communication module 160 through an I2S interface.
It should be understood that the connection relationship between the modules illustrated in this embodiment is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.
The electronic device implements display functions through the GPU, the display screen 150, and the application processor, etc. The GPU is a microprocessor for image processing, and is connected to the display 150 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
The display screen 150 is used to display images, videos, and the like. The display 150 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro-led, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device may include 1 or N display screens 150, N being a positive integer greater than 1.
A series of graphical user interfaces (graphical user interface, GUIs) may be displayed on the display screen 150 of the electronic device, all of which are home screens of the electronic device. Generally, the size of the display 150 of an electronic device is fixed and only limited controls can be displayed in the display 150 of the electronic device. A control is a GUI element that is a software component contained within an application program that controls all data processed by the application program and interactive operations on that data, and a user can interact with the control by direct manipulation (direct manipulation) to read or edit information about the application program. In general, controls may include visual interface elements such as icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, and the like. For example, in an embodiment of the present application, smooth text may be displayed on the display screen 150.
The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent cognition of electronic devices can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc. In an exemplary embodiment of the present application, based on the NPU, voice recognition may be performed on the audio to obtain the text to be processed.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.
The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
The electronic device may implement audio functionality through audio module 130, microphone 130A, etc. Such as music playing, recording, etc.
The audio module 130 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 130 may also be used to encode and decode audio signals. In some embodiments, the audio module 130 may be disposed in the processor 110, or a portion of the functional modules of the audio module 130 may be disposed in the processor 110.
Microphone 130A, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 130A through the mouth, inputting a sound signal to the microphone 130A. The electronic device may be provided with at least one microphone 130A. In other embodiments, the electronic device may be provided with two microphones 130A, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device may also be provided with three, four, or more microphones 130A to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc. Illustratively, in an embodiment of the present application, audio may be collected by microphone 130A to obtain text to be processed.
The pressure sensor 140A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 140A may be disposed on the display screen 150. The pressure sensor 140A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. When a force is applied to the pressure sensor 140A, the capacitance between the electrodes changes. The electronics determine the strength of the pressure from the change in capacitance. When a touch operation is applied to the display 150, the electronic device detects the intensity of the touch operation according to the pressure sensor 140A. The electronic device may also calculate the location of the touch based on the detection signal of the pressure sensor 140A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions.
The fingerprint sensor 140B is used to collect a fingerprint. The electronic equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access application locks, fingerprint recording and the like.
The touch sensor 140C, also referred to as a "touch device". The touch sensor 140C may be disposed on the display 150, and the touch sensor 140C and the display 150 form a touch screen, which is also referred to as a "touch screen". The touch sensor 140C is used to detect a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 150. In other embodiments, the touch sensor 140C may also be disposed on the surface of the electronic device at a different location than the display 150.
In addition, an operating system is run on the components. Such as the iOS operating system developed by apple corporation, the Android open source operating system developed by google corporation, the Windows operating system developed by microsoft corporation, etc. An operating application may be installed on the operating system.
The operating system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software structure of the electronic equipment is illustrated.
Fig. 9 is a software configuration block diagram of an electronic device according to an embodiment of the present application.
The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages. As shown in fig. 9, the application package may include notes, WLAN, bluetooth, etc. applications.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in FIG. 9, the application framework layer may include a window manager, a content provider, a view system, a text smoothing algorithm, and the like.
The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.
The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.
The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.
The text smoothing algorithm is used for carrying out text smoothing processing on the text to be processed converted by the audio to obtain a smooth text. Android run time includes a core library and virtual machines. Android run time is responsible for scheduling and management of the Android system.
The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), etc.
Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
Although the Android system is taken as an example for explanation, the basic principle of the embodiment of the application is also applicable to electronic devices based on iOS, windows and other operating systems.
The present embodiment also provides a computer-readable storage medium including instructions that, when executed on an electronic device, cause the electronic device to perform the related method steps described above to implement the method of the above embodiment.
The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of text smoothing comprising:
performing voice recognition on the audio to obtain a text to be processed;
performing label prediction on the text to be processed in a text smooth mode to obtain a first smooth result, wherein a training set of the text smooth mode is obtained by adding noise to sample data through an iteratively updated noise adding rule, the iteratively updated noise adding rule is obtained by performing deviation analysis on a predicted label and a real label, and the predicted label is obtained by performing label prediction on a test set through the text smooth mode in a training process;
and taking the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result.
2. The method according to claim 1, wherein the performing label prediction on the text-to-be-processed text-in-text smoothing model to obtain a first smoothing result further comprises:
word segmentation processing is carried out on the text to be processed, and word segmentation text is obtained;
according to rules for deleting useless characters in the text in the rule engine, performing smooth processing on the word segmentation text to obtain a second smooth result;
Combining the first smoothing result with the second smoothing result to obtain a third smoothing result;
the method for processing the text according to the original text confusion corresponding to the text to be processed and the text confusion corresponding to the first smooth result, taking the text corresponding to the low confusion degree as the smooth text, comprises the following steps:
and taking the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the third smooth result.
3. The method according to claim 1 or 2, wherein the performing label prediction on the text-to-be-processed text-in-text smooth model to obtain a first smooth result includes:
sentence processing is carried out on the text to be processed, and sentence text is obtained;
and inputting the clause text into a text smoothness model to conduct label prediction, so as to obtain the first smoothness result.
4. The method according to claim 3, wherein the step of setting the text corresponding to the low confusion degree as the smooth text after the step of setting the text corresponding to the low confusion degree as the smooth text according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the first smooth result further comprises:
Word segmentation processing is carried out on the clause text, and word segmentation text corresponding to the clause text is obtained;
according to a rule for deleting useless characters in the text in the rule engine, performing smooth processing on the word segmentation text corresponding to the clause text to obtain a second smooth result corresponding to the clause text;
combining the second smooth result corresponding to the clause text with the first smooth result to obtain a third smooth result corresponding to the clause text;
the method for processing the text according to the original text confusion corresponding to the text to be processed and the text confusion corresponding to the first smooth result, taking the text corresponding to the low confusion degree as the smooth text, comprises the following steps:
and according to the original text confusion degree corresponding to the text to be processed and the text confusion degree corresponding to the third smooth result corresponding to the clause text, taking the text corresponding to the low confusion degree as the smooth text.
5. The method of claim 2, wherein the rule for deleting the useless characters in the text in the rule engine includes deleting repeated words in the segmented text, deleting the word words in the segmented text and recovering the special words, and wherein the performing the smoothing processing on the segmented text according to the rule for deleting the useless characters in the text in the rule engine to obtain the second smoothing result includes:
Deleting repeated words and/or intonation words in the word segmentation text according to the repeated words in the word segmentation text and the intonation words in the word segmentation text, and obtaining a second smooth result;
combining the first smoothing result with the second smoothing result to obtain a third smoothing result, comprising:
combining the first smoothing result with the second smoothing result to obtain a sub-smoothing result;
and recovering the special words in the sub-smooth results according to the recovered special words to obtain the third smooth results.
6. The method of claim 1, wherein the training step of the text-smoothing model comprises:
noise is added to the sample data to obtain a training set and a testing set;
extracting features of the noisy text in the training set to obtain a feature matrix corresponding to the noisy text;
performing decoding prediction on the feature matrix to obtain an output label;
performing iterative training on the text smooth initial model according to the weighted cross entropy loss value between the output label and the real label to obtain a text smooth model of the intermediate edition;
inputting the test set into the text smooth model of the intermediate version to obtain a prediction label;
Based on a first deviation analysis result obtained by carrying out deviation analysis on the prediction tag and the real tag, iteratively updating a preset noise adding rule to obtain an iteratively updated noise adding rule;
according to the iterative updating noise adding rule, noise adding is carried out on the sample data, and an updated training set is obtained;
and carrying out iterative training on the text smooth model of the intermediate version based on the updated training set to obtain the text smooth model.
7. The method of claim 6, wherein iteratively training the intermediate version of the text smoothing model based on the updated training set to obtain the text smoothing model, comprises:
performing iterative training on the text smooth model of the intermediate version based on the updated training set to obtain a trained text smooth model;
combining the test set with the scene test set to obtain a test set of a fusion scene, and combining the sample data with a scene training set to obtain sample data of the fusion scene;
inputting the test set of the fusion scene into the trained text smooth model to obtain a test tag;
expanding the iteratively updated noise rule based on a second deviation analysis result obtained by performing deviation analysis on the test tag and the real tag, and obtaining an expanded noise rule;
According to the extended noise adding rule, noise adding is carried out on the sample data of the fusion scene, and a training set of the fusion scene is obtained;
and training the trained text smooth model based on the training set of the fusion scene to obtain the text smooth model.
8. The method of claim 7, wherein the combining the test set with the scene test set to obtain a test set of a fused scene, and the combining the sample data with the scene training set to obtain sample data of the fused scene, further comprises:
performing voice recognition on the scene data to obtain a scene text corresponding to the scene data;
and labeling the scene text to obtain the scene test set and the scene training set.
9. An electronic device, the electronic device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the steps of a text smoothing method as claimed in any of claims 1-8 according to instructions in the program code.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a text smoothing method according to any of claims 1-8.
CN202310682675.8A 2023-06-09 2023-06-09 Text smoothing method, device and storage medium Active CN116434753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310682675.8A CN116434753B (en) 2023-06-09 2023-06-09 Text smoothing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310682675.8A CN116434753B (en) 2023-06-09 2023-06-09 Text smoothing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN116434753A CN116434753A (en) 2023-07-14
CN116434753B true CN116434753B (en) 2023-10-24

Family

ID=87081768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310682675.8A Active CN116434753B (en) 2023-06-09 2023-06-09 Text smoothing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116434753B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110853621A (en) * 2019-10-09 2020-02-28 科大讯飞股份有限公司 Voice smoothing method and device, electronic equipment and computer storage medium
CN111797895A (en) * 2020-05-30 2020-10-20 华为技术有限公司 Training method of classifier, data processing method, system and equipment
CN112163530A (en) * 2020-09-30 2021-01-01 江南大学 SSD small target detection method based on feature enhancement and sample selection
CN113140221A (en) * 2021-04-27 2021-07-20 深圳前海微众银行股份有限公司 Language model fusion method, device, medium and computer program product
CN114611492A (en) * 2022-03-17 2022-06-10 北京中科智加科技有限公司 Text smoothing method and system and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11934795B2 (en) * 2021-01-29 2024-03-19 Oracle International Corporation Augmented training set or test set for improved classification model robustness

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110853621A (en) * 2019-10-09 2020-02-28 科大讯飞股份有限公司 Voice smoothing method and device, electronic equipment and computer storage medium
CN111797895A (en) * 2020-05-30 2020-10-20 华为技术有限公司 Training method of classifier, data processing method, system and equipment
CN112163530A (en) * 2020-09-30 2021-01-01 江南大学 SSD small target detection method based on feature enhancement and sample selection
CN113140221A (en) * 2021-04-27 2021-07-20 深圳前海微众银行股份有限公司 Language model fusion method, device, medium and computer program product
CN114611492A (en) * 2022-03-17 2022-06-10 北京中科智加科技有限公司 Text smoothing method and system and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自注意力机制的口语文本顺滑算法;吴双志;张冬冬;周明;;智能计算机与应用(第06期);全文 *

Also Published As

Publication number Publication date
CN116434753A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN110490213B (en) Image recognition method, device and storage medium
CN111563144B (en) User intention recognition method and device based on statement context prediction
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
JP2021067939A (en) Method, apparatus, device and medium for interactive voice control
CN111985240B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
CN111611805B (en) Auxiliary writing method, device, medium and equipment based on image
CN112069309B (en) Information acquisition method, information acquisition device, computer equipment and storage medium
US12039766B2 (en) Image processing method, apparatus, and computer product for image segmentation using unseen class obtaining model
CN111026861B (en) Text abstract generation method, training device, training equipment and medium
JP2017527926A (en) Generation of computer response to social conversation input
WO2022253061A1 (en) Voice processing method and related device
US20230034414A1 (en) Dialogue processing apparatus, learning apparatus, dialogue processing method, learning method and program
CN110634466A (en) TTS treatment technology with high infectivity
WO2023207541A1 (en) Speech processing method and related device
CN111858898A (en) Text processing method and device based on artificial intelligence and electronic equipment
CN114091466A (en) Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN117216544A (en) Model training method, natural language processing method, device and storage medium
CN113761888A (en) Text translation method and device, computer equipment and storage medium
CN116547681A (en) Dynamic language model for continuously evolving content
JP2024012152A (en) Method for identify word corresponding to target word in text information
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
WO2020214254A1 (en) Layer trajectory long short-term memory with future context
CN111460231A (en) Electronic device, search method for electronic device, and medium
CN113591472B (en) Lyric generation method, lyric generation model training method and device and electronic equipment
KR20220040997A (en) Electronic apparatus and control method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant