WO2023130951A1 - Speech sentence segmentation method and apparatus, electronic device, and storage medium - Google Patents

Speech sentence segmentation method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2023130951A1
WO2023130951A1 PCT/CN2022/140275 CN2022140275W WO2023130951A1 WO 2023130951 A1 WO2023130951 A1 WO 2023130951A1 CN 2022140275 W CN2022140275 W CN 2022140275W WO 2023130951 A1 WO2023130951 A1 WO 2023130951A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
text
type
predicted
test
Prior art date
Application number
PCT/CN2022/140275
Other languages
French (fr)
Chinese (zh)
Inventor
李嘉辉
肖畅
翁志伟
孙仿逊
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2023130951A1 publication Critical patent/WO2023130951A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of natural language processing, and in particular to a speech sentence segmentation method, device, electronic equipment and storage medium.
  • Existing speech segmentation methods are mainly applied to intelligent dialogue systems in multiple application scenarios through natural language processing technology in the field of artificial intelligence.
  • the voice segmentation method can be used to identify multiple independent instructions contained in the user's voice command, so as to segment the voice command so as to reasonably execute each independent instruction.
  • the existing common models for natural language processing need to predict the category of each word in the user's voice command when the speech is segmented, and there is a problem of heavy model tasks.
  • the first aspect of the present application provides a method for speech segmentation, including:
  • a continuous sentence is constructed for the first target speech text, and the plurality of test sentence sentences included in the continuous sentence are intercepted from the first target speech text and correspond to different preset sentence types respectively the text fragment of
  • the predicted classification results include the corresponding predicted probabilities of each of the test sentences in the preset sentence types;
  • the first target speech text is segmented according to the predicted sentence type corresponding to each test sentence.
  • the second aspect of the present application provides a speech sentence segmentation device, including:
  • the construction module is used for constructing continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and the plurality of test sentences included in the continuous sentences are intercepted from the first target speech text and different preset Text fragments corresponding to the sentence types;
  • An acquisition module configured to acquire the predicted classification result of the test sentence through the trained classification model; the predicted classification result includes the predicted probability corresponding to the test sentence in the preset sentence type;
  • a determining module configured to determine a predicted sentence type corresponding to the test sentence according to the predicted classification result
  • the sentence segmentation module is configured to segment the first target speech text according to the predicted sentence type corresponding to each of the test sentences.
  • the third aspect of the present application provides an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor can implement any one of the methods disclosed in the present application.
  • the fourth aspect of the present application provides a computer-readable storage medium, which stores a computer program, wherein the computer program enables the computer to execute any one of the speech segmentation methods disclosed in the present application.
  • the phonetic sentence segmentation method According to the phonetic sentence segmentation method provided by the application, according to the classification rules of the preset sentence type, a plurality of text fragments respectively corresponding to different preset sentence types are intercepted from the first target speech text, and the multiple text fragments are used as multiple Test sentences; input multiple test sentences into the trained classification model, and output the corresponding prediction probability of the test sentences in each preset sentence type from the trained classification model; The prediction probability of , after determining the predicted sentence type corresponding to the test sentence, sentence the first target speech text.
  • the speech segmentation method in this application can classify the user's voice command into multiple speech sentences, and confirm whether the sentence type of the current speech sentence is a complete independent sentence, without using the training model to detect the category of each word in the user's voice command , which simplifies the model structure and improves the accuracy of speech segmentation.
  • Fig. 1 is the schematic flow chart of a kind of speech segmentation method disclosed in the present application
  • Fig. 2 is the schematic flow chart of another kind of voice sentence segmentation method disclosed in the present application.
  • Fig. 3 is the schematic flow chart of another kind of phonetic punctuation method disclosed in the present application.
  • Fig. 4 is a schematic structural diagram of a speech sentence segmentation device disclosed in the present application.
  • Fig. 5 is a schematic structural diagram of a model training device disclosed in the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device disclosed in the present application.
  • first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
  • first information may also be called second information, and similarly, second information may also be called first information.
  • second information may also be called first information.
  • a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • “plurality” means two or more, unless otherwise specifically defined.
  • the application discloses a phonetic sentence segmentation method, device, electronic equipment and storage medium, which improves the accuracy of phonetic sentence segmentation, and will be described in detail below.
  • FIG. 1 is a schematic flow chart of a speech segmentation method disclosed in the present application.
  • This method can be applied to various smart terminals, such as smart phones, smart homes, wearable devices, vehicle-mounted terminals and other electronic devices, which are not specifically limited.
  • the usage scenarios of this method can be industries and scenarios such as smart home, vehicle voice, intelligent customer service, medical scenarios, and industrial scenarios.
  • the method includes the following steps:
  • the multiple test sentences included in the continuous sentence are text segments that are extracted from the first target speech text and respectively correspond to different preset sentence types.
  • the first target speech text is generated by performing text preprocessing after converting the audio signal in the user's speech command into corresponding text data through the process of recognition and understanding.
  • the first target voice text has the characteristics of corresponding application scenarios for different application scenarios.
  • the obtained first target voice text may include: search for popular songs, turn up the volume, switch to the next song, etc.
  • the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes the text of incomplete instructions, the second type of sentence includes the text of complete instructions, and the third type
  • the sentence includes the complete instruction text and the incremental text of N words except the text of the complete instruction, and the fourth type of sentence includes the complete instruction text and the incremental text of M words except the text of the complete instruction; N and M are positive Integer, M greater than N.
  • the text of the complete instruction can be a semantically complete text, such as "turn on the air conditioner”
  • the text of the incomplete instruction can be a semantically incomplete text, such as "open the air conditioner”.
  • N can be 1, such as "turn on the air conditioner”
  • M can be Take 2, for example, "turn on the air conditioner and turn it off”.
  • continuous sentence sentences may be constructed for the first target speech text according to the classification rules of preset sentence types.
  • Continuous sentences can include a plurality of test sentences whose preset sentence types are the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence is "open the car”, and the second type of sentence is " Open the window”, the third type of sentence is "open the window and close", and the fourth type of sentence is "open the window and close”.
  • the predicted classification results include the predicted probabilities corresponding to each test sentence in the preset sentence types.
  • the classification model can be a pre-trained language model in the field of natural language processing, namely BERT (Bidirectional Encoder Representations from Transformers) model, ELMo (Embedding from Language Models) model, or any of the ALBERT (ALITE BERT) models.
  • BERT Bidirectional Encoder Representations from Transformers
  • ELMo Embedding from Language Models
  • ALBERT ALBERT is an improved version of the BERT model, with far fewer parameters than the traditional BERT model structure, which improves the training speed and model performance.
  • the improvement of ALBERT mainly lies in the factorization of embedding layer parameters, cross-layer parameter sharing mechanism, and inter-sentence continuity loss function.
  • the trained classification model is obtained by using a large number of training sentence texts for training. Continuous sentence input is input to the trained classification model, and the corresponding prediction probability of each test sentence in the continuous sentence in the preset sentence type can be output.
  • the predicted sentence type corresponding to each test sentence in each test sentence is determined.
  • the prediction probability of the first test sentence belonging to the second type of sentence is the largest
  • the prediction probability of the second test sentence belonging to the third type of sentence is the largest
  • the prediction probability of the third test sentence belonging to the fourth type of sentence is the largest .
  • Whether the predicted classification result is credible can be determined by setting a threshold. For different preset sentence segmentation types, different thresholds can be set.
  • the test sentence is confirmed to be the second type of sentence. If the predicted probabilities of the three test sentences belonging to the second type of sentence are 0.8, 0.5, and 0.4 respectively, when the threshold is set to 0.6, the test sentence with a predicted probability of 0.8 belongs to the second type of sentence.
  • FIG. 2 is a schematic flow chart of another voice sentence segmentation method disclosed in the present application, which can be applied to any of the aforementioned electronic devices. As shown in Figure 2, the method includes the following steps:
  • Naive speech text is unpreprocessed text data collected from user utterance commands. That is, it may be the text data obtained after directly converting the collected audio signal into corresponding text data.
  • Commonly used voice templates can be multiple commonly used voice commands that match the current application scenario. For example, in a vehicle scenario, it can be "close the window”, “turn on the air conditioner”, etc., or a prefix word in the user's voice command, such as " Please", "I want” and so on.
  • the user's utterance commands can be prefixed and some action words can be adapted to achieve a preliminary semantic understanding and realize data preprocessing of the initial voice text.
  • the commonly used speech templates are "close the car window” and "turn on the air conditioner”.
  • prefix words or some action words of the initial speech text include "close the car window", which happens to be the same as the commonly used speech template, then directly execute "close the car window” , and delete the four words "close the car window” from the initial speech text, and determine the initial speech text after deleting these four words as the first target speech text.
  • the phonetic sentence segmentation method of the present application can decompose the discourse sentence segmentation task into two sub-tasks.
  • the first task is to predict whether the first few words in the initial phonetic text can form a semantically complete sentence
  • the second task is based on the first objective.
  • the prediction and classification results of the continuous sentence in the speech text are used to judge whether there is a sentence with complete semantics in the continuous sentence, and if there is, the sentence can be carried out.
  • step 203-step 205 For the implementation manner of step 203-step 205, reference may be made to step 110-step 130 in the foregoing embodiment, and the following content will not be repeated.
  • test sentences whose predicted sentence type is the second type of sentence, the third type of sentence, and the fourth type of sentence appear continuously in each test sentence included in the continuous sentence, then determine the test sentence whose predicted sentence type is the second type of sentence is the first sentence text.
  • test sentence types of the first type of sentence, the second type of sentence, and the third type of test sentence appear continuously in the continuous sentence, then ignore the continuous test sentence and re-check whether there is a continuous prediction Sentence types are test sentences of the second type of sentence, the third type of sentence, and the fourth type of sentence.
  • first sentence segmentation text conforms to the business logic rule, determine the first sentence segmentation text as a sentence segmentation result of the first target speech text.
  • first sentence segmentation text does not comply with the business logic rule, ignore the first sentence segmentation text and do not perform sentence segmentation on the first target speech text.
  • the business logic rule may be that when the last character of the first segmented text is located before the word "and" in the first target phonetic text, the disconnection operation is not performed. For example, for "open the car window and air conditioner", if the first sentence text is "open the car window”, that is to say, the first target speech text is disconnected between "open the car window” and "and air conditioner", then ignore the first Sentence text, do not perform sentence segmentation on the first target speech text.
  • the sliding window is the logical window in the sliding window algorithm, which generally acts on strings or arrays.
  • the algorithm can be run within a window of a certain size.
  • the window sliding process it is necessary to delete elements that slide out of the window and add elements that slide into the window.
  • the second target phonetic text is the remaining text content after deleting the first sentence text from the first target phonetic text. After segmenting the first target phonetic text, after sliding the sliding window to the corresponding breakpoint mark, the first segmented text leaves the sliding window, and the starting point of the sliding window starts from the second target phonetic text.
  • first target speech text may contain multiple user instructions, and the first sentence sentence text only corresponds to one of the user instructions, for other text contents in the first target speech text other than the first sentence sentence text, continue to perform steps 120-140. process until all user instructions contained in the first target voice text are executed.
  • step 210 is similar to the aforementioned steps 203-209, and the following content will not be repeated.
  • the sliding window makes full use of the classification results of the classification model, and can eliminate the interference of the first sentence text on the subsequent sentence segmentation process, and focus on the sentence segmentation of the second target speech text. For example, if the first target voice text is "close the car window, turn on the air conditioner and turn on the ambient light", and the determined first sentence text is "close the car window”, then the second target speech text is "turn on the air conditioner and turn on the ambient light”.
  • the speech segmentation method of traditional sequence labeling uses a model to predict the category of each word in a short sentence as one of the beginning, middle, end, and irrelevant.
  • the disadvantage is that the single model task is too heavy and needs to predict the category of each word.
  • the phonetic sentence segmentation method of the present application does not need to use a model to determine the category of each word in the first target phonetic text, but segments the first target phonetic text according to the predicted sentence type corresponding to each test sentence in the first target phonetic text.
  • the use of sliding windows can achieve more optimal breakpoint selection, obtain higher fault tolerance, and achieve higher precision, recall and sentence accuracy.
  • the following table 1 illustrates the comparison of the precision rate, recall rate and sentence accuracy rate between the sequence tagging speech segmentation method and the sliding window speech segmentation method.
  • Table 1 Example of test results of sliding window speech segmentation method and sequence annotation speech segmentation method
  • FIG. 3 is a schematic flow chart of another speech segmentation method disclosed in the present application.
  • the multiple training sentence segmentation texts included in the sample data are generated according to classification rules of preset sentence segmentation types.
  • the multiple training texts may be a certain number of commonly used instructions of users manually selected according to the characteristics of different application scenarios, which can be used to describe the needs of users in the application scenarios.
  • a plurality of training sentence texts can also include sentences related to the first type of sentence, the second type of sentence Sentence sentences, sentence sentences of the third type, and sentence sentences of the fourth type correspond to the training sentence texts respectively.
  • each training sentence text included in the sample data may correspond to a real sentence type.
  • the real sentence type may be manually marked, or it may be an accurate sentence type identified based on other classification methods, which is not specifically limited.
  • the method of selecting training sentence texts from the sample data may be random selection, and may be continuous sentence selections selected according to classification rules of preset sentence types, which is not specifically limited.
  • the training classification results include prediction probabilities corresponding to the training sentence texts in preset sentence types.
  • the predicted sentence segmentation types corresponding to each training sentence segmentation text are determined.
  • the calculated loss may be L1 loss, L2 loss, cross-entropy loss, etc., but is not limited thereto.
  • the method for adjusting model parameters may be gradient descent method, grid search method, random search method, Bayesian optimization method, etc., but is not limited thereto.
  • the aforementioned steps 310 to 350 may be a process of training the classification model.
  • the predicted probability output by the classification model obtained after training is relatively accurate, and can be applied to scenarios such as vehicle-mounted speech recognition.
  • the following steps 360-390 are performed to segment voice commands.
  • FIG. 4 is a schematic structural diagram of a speech punctuation device disclosed in the present application.
  • the device can be applied to electronic devices such as vehicle terminals, and is not specifically limited.
  • the speech segmentation device 400 may include: a construction module 410 , an acquisition module 420 , a determination module 430 , and a sentence segmentation module 640 .
  • the construction module 410 is used to construct continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and a plurality of test sentences included in the continuous sentences are intercepted from the first target speech text, and are different from the preset Text fragments corresponding to the sentence types;
  • the obtaining module 420 is used to obtain the predicted classification results of continuous sentences through the classification model trained; the predicted classification results include the corresponding predicted probabilities of each test sentence in the preset sentence type;
  • Determining module 430 is used for determining the predicted sentence type corresponding to each test sentence according to the prediction classification result
  • the sentence segmentation module 440 is configured to segment the first target speech text according to the predicted sentence type corresponding to each test sentence.
  • the determination module 430 can be used to determine the corresponding prediction probability of each test sentence in each test sentence according to the prediction probability corresponding to each test sentence included in the prediction classification result, and the probability threshold corresponding to each preset sentence type. Predicted sentence type.
  • the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes the text of the incomplete instruction, and the second type of sentence includes the complete instruction
  • the third type of sentence includes the complete instruction text and the incremental text of N words other than the text of the complete instruction
  • the fourth type of sentence includes the complete instruction text and the incremental text of M words other than the text of the complete instruction
  • N and M are positive integers, and M is greater than N;
  • the multiple test sentences included in the continuous sentence at least correspond to the first type of sentence, the second type of sentence, the third type of sentence and the fourth type of sentence respectively.
  • the sentence segmentation module 440 further includes a determination unit and a sentence segmentation unit.
  • the determining unit may be used for if the test sentences whose predicted sentence types are the second type of sentence, the third type of sentence, and the fourth type of sentence appear continuously in each test sentence included in the continuous sentence, then the predicted sentence type The test sentence that is the second type of sentence is determined as the first sentence text.
  • the sentence segmentation unit is configured to segment the first target speech text according to the first sentence segmentation text.
  • the sentence segmentation unit can also be used to determine the first sentence sentence text as the sentence segmentation result of the first target speech text if the first sentence sentence text conforms to the business logic rule; if the If the first sentence segmentation text does not conform to the business logic rule, the first sentence segmentation text is ignored, and no sentence segmentation is performed on the first target speech text.
  • the speech sentence segmentation device further includes a sliding module, configured to obtain a breakpoint mark corresponding to the sentence segmentation of the first target speech text, and slide the sliding window behind the breakpoint mark, and The text in the sliding window is determined as the second target phonetic text; the classification model using the trained classification model is used to obtain the predicted classification result of the second target phonetic text, and use the predicted classification result of the second target phonetic text to The second target speech text is segmented.
  • a sliding module configured to obtain a breakpoint mark corresponding to the sentence segmentation of the first target speech text, and slide the sliding window behind the breakpoint mark, and The text in the sliding window is determined as the second target phonetic text; the classification model using the trained classification model is used to obtain the predicted classification result of the second target phonetic text, and use the predicted classification result of the second target phonetic text to The second target speech text is segmented.
  • the phonetic sentence segmentation device also includes a preprocessing module for obtaining an initial phonetic text; deleting the initial phonetic text in the initial phonetic text that is consistent with a commonly used phonetic template from the initial phonetic text, and The initial speech text after deleting the initial sentence sentence text is determined as the first target speech text.
  • the speech sentence segmentation device can also be used in the model training device 500 .
  • FIG. 5 is a schematic structural diagram of a model training device disclosed in the present application.
  • the model training device can be applied to electronic devices with strong computing capabilities such as servers and computers; or, the model training device can also be applied to terminal devices with weak computing capabilities such as vehicle terminals, which are not specifically limited.
  • the model training device 500 may include: an acquisition module 510 , a selection module 520 , a training module 530 , a determination module 540 , and an adjustment module 550 .
  • An acquisition module 510 configured to acquire sample data; a plurality of training sentence sentence texts included in the sample data are generated according to classification rules of preset sentence type;
  • the selection module 520 is used to select the training sentence sentence text from the sample data
  • the training module 530 is used to input the training sentence sentence text into the classification model to be trained to obtain the training classification result of the training sentence sentence text; the training classification result includes each sample sentence sentence text in the training sentence sentence text corresponding to the preset sentence type predicted probability;
  • Determining module 540 is used to determine the training sentence type of the training sentence text according to the training classification result of the training sentence text;
  • the adjustment module 550 is used to calculate the training loss according to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, and adjust the parameters in the classification model to be trained according to the training loss , to get the trained classification model.
  • FIG. 6 is a schematic structural diagram of an electronic device disclosed in the present application. As shown in FIG. 6, the electronic device 600 may include:
  • a memory 610 storing executable program code
  • processor 620 coupled to the memory 610;
  • the processor 620 invokes the executable program code stored in the memory 610 to execute any one of the voice sentence segmentation methods disclosed in this application.
  • the present application discloses a computer-readable storage medium, which stores a computer program, wherein, when the computer program is executed by the processor, the processor is made to implement any one of the speech segmentation methods disclosed in the present application.
  • sequence numbers of the above-mentioned processes do not necessarily mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the requirements of the present application.
  • the implementation process constitutes no limitation.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, located in one place, or distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-accessible memory.
  • the technical solution of the present application in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a memory , including several requests to make a computer device (which may be a personal computer, server, or network device, etc., specifically, a processor in the computer device) execute some or all of the steps of the above-mentioned methods in various embodiments of the present application.
  • ROM read-only Memory
  • RAM random access memory
  • PROM programmable read-only memory
  • EPROM Erasable Programmable Read Only Memory
  • OTPROM One-time Programmable Read-Only Memory
  • EEPROM Electronically Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read-Only Memory

Abstract

A speech sentence segmentation method and apparatus, an electronic device, and a storage medium. The method comprises: constructing continuously segmented sentences for a first target speech text according to a classification rule of a preset segmented sentence type (110); obtaining a predicted classification result of the continuously segmented sentences by means of a trained classification model (120); determining, according to the predicted classification result, a predicted segmented sentence type respectively corresponding to each test segmented sentence (130); and performing sentence segmentation on the first target speech text according to the predicted segmented sentence type respectively corresponding to each test segmented sentence (140).

Description

语音断句方法、装置、电子设备及存储介质Speech sentence segmentation method, device, electronic equipment and storage medium
本申请要求于2022年01月04日提交国家知识产权局、申请号为202210001104.9、申请名称为“语音断句方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210001104.9 and the application name "method, device, electronic equipment and storage medium for phonetic sentence segmentation" submitted to the State Intellectual Property Office on January 04, 2022, the entire content of which is incorporated by reference incorporated in this application.
技术领域technical field
本申请涉及自然语言处理技术领域,具体涉及一种语音断句方法、装置、电子设备及存储介质。The present application relates to the technical field of natural language processing, and in particular to a speech sentence segmentation method, device, electronic equipment and storage medium.
背景技术Background technique
现有的语音断句方法主要通过人工智能领域的自然语言处理技术,应用于多个应用场景下的智能对话系统。比如在车载应用场景下的智能对话系统中,利用语音断句方法可以识别用户语音命令中包含的多个独立指令,从而对语音命令进行断句,以便合理地执行各个独立指令。但是,在实践中发现,现有的自然语言处理常用模型在语音断句时需要预测用户语音命令中每一个字的类别,存在模型任务繁重的问题。Existing speech segmentation methods are mainly applied to intelligent dialogue systems in multiple application scenarios through natural language processing technology in the field of artificial intelligence. For example, in an intelligent dialogue system in a vehicle application scenario, the voice segmentation method can be used to identify multiple independent instructions contained in the user's voice command, so as to segment the voice command so as to reasonably execute each independent instruction. However, in practice, it is found that the existing common models for natural language processing need to predict the category of each word in the user's voice command when the speech is segmented, and there is a problem of heavy model tasks.
发明内容Contents of the invention
本申请第一方面提供一种语音断句方法,包括:The first aspect of the present application provides a method for speech segmentation, including:
根据预设断句类型的分类规则,为第一目标语音文本构造连续断句,所述连续断句包括的多个测试断句是从第一目标语音文本中截取出来,且与不同的预设断句类型分别对应的文本片段;According to the classification rules of the preset sentence types, a continuous sentence is constructed for the first target speech text, and the plurality of test sentence sentences included in the continuous sentence are intercepted from the first target speech text and correspond to different preset sentence types respectively the text fragment of
通过训练完成的分类模型获取所述连续断句的预测分类结果;所述预测分类结果包括各个所述测试断句在所述预设断句类型中对应的预测概率;Obtaining the predicted classification results of the continuous sentences through the trained classification model; the predicted classification results include the corresponding predicted probabilities of each of the test sentences in the preset sentence types;
根据所述预测分类结果确定与各个所述测试断句分别对应的预测断句类型;Determining the predicted sentence types corresponding to each of the test sentences according to the predicted classification results;
根据各个所述测试断句分别对应的预测断句类型对所述第一目标语音文本进行断句。The first target speech text is segmented according to the predicted sentence type corresponding to each test sentence.
本申请第二方面提供一种语音断句装置,包括:The second aspect of the present application provides a speech sentence segmentation device, including:
构造模块,用于根据预设断句类型的分类规则,对第一目标语音文本构造连续断句,所述连续断句包括的多个测试断句是从第一目标语音文本中截取出来的与不同的预设断句类型分别对应的文本片段;The construction module is used for constructing continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and the plurality of test sentences included in the continuous sentences are intercepted from the first target speech text and different preset Text fragments corresponding to the sentence types;
获取模块,用于通过训练完成的分类模型获取所述测试断句的预测分类结果;所述预测分类结果包括所述测试断句在所述预设断句类型中对应的预测概率;An acquisition module, configured to acquire the predicted classification result of the test sentence through the trained classification model; the predicted classification result includes the predicted probability corresponding to the test sentence in the preset sentence type;
确定模块,用于根据所述预测分类结果确定与所述测试断句对应的预测断句类型;A determining module, configured to determine a predicted sentence type corresponding to the test sentence according to the predicted classification result;
断句模块,用于根据各个所述测试断句对应的预测断句类型对所述第一目标语音文本进行断句。The sentence segmentation module is configured to segment the first target speech text according to the predicted sentence type corresponding to each of the test sentences.
本申请第三方面提供一种电子设备,包括存储器及处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器实现本申请公开的任意一种语音断句方法。The third aspect of the present application provides an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor can implement any one of the methods disclosed in the present application. A method of phonetic sentence segmentation.
本申请第四方面提供一种算机可读存储介质,其存储计算机程序,其中,所述计算机程序使得计算机执行本申请公开的任意一种语音断句方法。The fourth aspect of the present application provides a computer-readable storage medium, which stores a computer program, wherein the computer program enables the computer to execute any one of the speech segmentation methods disclosed in the present application.
依据本申请提供的语音断句方法,根据预设断句类型的分类规则,从第一目标语音文本中截取出与不同的预设断句类型分别对应的多个文本片段,将多个文本片段作为多个测试断句;将多个测试断句输入到训练完成的分类模型中,从训练完成的分类模型中输出测试断句在各个预设断句类型中对应的预测概率;根据测试断句在各个预设断句类型中对应的预测概率,在确定与测试断句对应的预测断句类型之后,对第一目标语音文本进行断句。本申请中的语音断句方法能够将用户语音命令分类成多个语音断句,并确认当前语音断句的句子类型是否是个完整的独立断句,无需将训练模型用于检测用户语音命令中每个字的类别,简化了模型结构,提高了语音断句的准确率。According to the phonetic sentence segmentation method provided by the application, according to the classification rules of the preset sentence type, a plurality of text fragments respectively corresponding to different preset sentence types are intercepted from the first target speech text, and the multiple text fragments are used as multiple Test sentences; input multiple test sentences into the trained classification model, and output the corresponding prediction probability of the test sentences in each preset sentence type from the trained classification model; The prediction probability of , after determining the predicted sentence type corresponding to the test sentence, sentence the first target speech text. The speech segmentation method in this application can classify the user's voice command into multiple speech sentences, and confirm whether the sentence type of the current speech sentence is a complete independent sentence, without using the training model to detect the category of each word in the user's voice command , which simplifies the model structure and improves the accuracy of speech segmentation.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
附图说明Description of drawings
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性 实施方式中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present application will become more apparent by describing the exemplary embodiments of the present application in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present application, the same reference numerals generally represent same parts.
图1是本申请公开的一种语音断句方法的流程示意图;Fig. 1 is the schematic flow chart of a kind of speech segmentation method disclosed in the present application;
图2是本申请公开的另一种语音断句方法的流程示意图;Fig. 2 is the schematic flow chart of another kind of voice sentence segmentation method disclosed in the present application;
图3是本申请公开的另一种语音断句方法的流程示意图;Fig. 3 is the schematic flow chart of another kind of phonetic punctuation method disclosed in the present application;
图4是本申请公开的一种语音断句装置的结构示意图;Fig. 4 is a schematic structural diagram of a speech sentence segmentation device disclosed in the present application;
图5是本申请公开的一种模型训练装置的结构示意图;Fig. 5 is a schematic structural diagram of a model training device disclosed in the present application;
图6是本申请公开的一种电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device disclosed in the present application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.
本申请公开了一种语音断句方法、装置、电子设备及存储介质,提高了语音断句的准确率,以下分别进行详细说明。The application discloses a phonetic sentence segmentation method, device, electronic equipment and storage medium, which improves the accuracy of phonetic sentence segmentation, and will be described in detail below.
以下结合附图详细描述本申请的技术方案。The technical solution of the present application will be described in detail below in conjunction with the accompanying drawings.
请参阅图1,图1是本申请公开的一种语音断句方法的流程示意图。该方法可应用于各种智能终端,如智能手机、智能家居、可穿戴设备、车载终端 等电子设备,具体不做限定。该方法的使用场景可以是智能家居、车载语音、智能客服、医疗场景、工业场景等行业和场景。Please refer to FIG. 1 . FIG. 1 is a schematic flow chart of a speech segmentation method disclosed in the present application. This method can be applied to various smart terminals, such as smart phones, smart homes, wearable devices, vehicle-mounted terminals and other electronic devices, which are not specifically limited. The usage scenarios of this method can be industries and scenarios such as smart home, vehicle voice, intelligent customer service, medical scenarios, and industrial scenarios.
例如,在车载语音场景中,当用户需要通过车载智能对话系统发出语音命令来操控系统时,当语音命令中包含多个独立指令时,比如“打开空调关闭车窗氛围灯打开播放音乐”,需要对语音命令进行断句,从而合理地执行用户语音命令中包含的各个独立指令。For example, in the car voice scene, when the user needs to issue a voice command to control the system through the car intelligent dialogue system, when the voice command contains multiple independent instructions, such as "turn on the air conditioner, close the window, and turn on the ambient light to play music", it is necessary to Segment the voice command so that each independent instruction contained in the user voice command can be reasonably executed.
如图1所示,该方法包括以下步骤:As shown in Figure 1, the method includes the following steps:
110、根据预设断句类型的分类规则,为第一目标语音文本构造连续断句。110. Construct continuous sentences for the first target speech text according to the classification rules of preset sentence types.
连续断句包括的多个测试断句是从第一目标语音文本中截取出来,且与不同的预设断句类型分别对应的文本片段。The multiple test sentences included in the continuous sentence are text segments that are extracted from the first target speech text and respectively correspond to different preset sentence types.
第一目标语音文本是通过识别和理解过程把用户话语命令中的音频信号转变为对应的文本数据后,进行文本预处理生成的。The first target speech text is generated by performing text preprocessing after converting the audio signal in the user's speech command into corresponding text data through the process of recognition and understanding.
其中,第一目标语音文本针对不同的应用场景,具有相应应用场景的特点,例如,针对车载语音场景,得到的第一目标语音文本可以包括:打开车窗打开音乐打开灯光等;针对智能家居场景,比如在智能音响的使用场景下,得到的第一目标语音文本可以包括:搜索热门歌曲调高音量切换下一首等。Among them, the first target voice text has the characteristics of corresponding application scenarios for different application scenarios. , for example, in the usage scenario of smart audio, the obtained first target voice text may include: search for popular songs, turn up the volume, switch to the next song, etc.
预设断句类型至少包括第一类断句,第二类断句,第三类断句,第四类断句;第一类断句包括非完整指令的文本,第二类断句包括完整指令的文本,第三类断句包括完整指令文本以及除了完整指令的文本以外的N个字的增量文本,第四类断句包括完整指令文本以及除了完整指令的文本以外的M个字的增量文本;N和M为正整数,M大于N。The preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes the text of incomplete instructions, the second type of sentence includes the text of complete instructions, and the third type The sentence includes the complete instruction text and the incremental text of N words except the text of the complete instruction, and the fourth type of sentence includes the complete instruction text and the incremental text of M words except the text of the complete instruction; N and M are positive Integer, M greater than N.
其中,对于第一类断句,完整指令的文本可以是语义上完整的文本,比如“打开空调”;对于第二类断句,非完整指令的文本可以是语义上不完整的文本,比如“打开空”;对于第三类断句,增量文本可以是一个句子中除了语义上完整的文本外还包括的多余的文本,N可以取1,比如“打开空调关”;对于第四类断句,M可以取2,比如“打开空调关闭”。Among them, for the first type of sentence, the text of the complete instruction can be a semantically complete text, such as "turn on the air conditioner"; for the second type of sentence, the text of the incomplete instruction can be a semantically incomplete text, such as "open the air conditioner". "; for the third type of sentence sentence, the incremental text can be redundant text included in a sentence in addition to the semantically complete text, and N can be 1, such as "turn on the air conditioner"; for the fourth type of sentence sentence, M can be Take 2, for example, "turn on the air conditioner and turn it off".
预设断句类型的分类规则包括将第一目标语音文本至少分成第一类断句,第二类断句,第三类断句,第四类断句。The classification rules of the preset sentence types include at least classifying the first target speech text into the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence.
例如,当第一目标语音文本为“打开车窗关闭空调”时,可根据预设断句类型的分类规则对第一目标语音文本构造连续断句。连续断句可以包括预 设断句类型为第一类断句、第二类断句、第三类断句、第四类断句的多个测试断句;第一类断句是“打开车”,第二类断句是“打开车窗”,第三类断句是“打开车窗关”,第四类断句是“打开车窗关闭”。For example, when the first target speech text is "open the car window and turn off the air conditioner", continuous sentence sentences may be constructed for the first target speech text according to the classification rules of preset sentence types. Continuous sentences can include a plurality of test sentences whose preset sentence types are the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence is "open the car", and the second type of sentence is " Open the window", the third type of sentence is "open the window and close", and the fourth type of sentence is "open the window and close".
120、通过训练完成的分类模型获取连续断句的预测分类结果。120. Obtain the prediction and classification results of consecutive sentence segments through the trained classification model.
预测分类结果包括各个测试断句在预设断句类型中对应的预测概率。The predicted classification results include the predicted probabilities corresponding to each test sentence in the preset sentence types.
分类模型,可以是自然语言处理领域中的预训练语言模型,即BERT(Bidirectional Encoder Representations from Transformers)模型,ELMo(Embedding from Language Models)模型,或者ALBERT(A LITE BERT)模型中的任一项。其中,ALBERT是BERT模型的改进版本,参数量远远少于传统的BERT模型结构,提升了训练速度和模型性能。ALBERT的改进主要在于嵌入层参数因式分解、跨层参数共享机制、句间连续性损失函数。The classification model can be a pre-trained language model in the field of natural language processing, namely BERT (Bidirectional Encoder Representations from Transformers) model, ELMo (Embedding from Language Models) model, or any of the ALBERT (ALITE BERT) models. Among them, ALBERT is an improved version of the BERT model, with far fewer parameters than the traditional BERT model structure, which improves the training speed and model performance. The improvement of ALBERT mainly lies in the factorization of embedding layer parameters, cross-layer parameter sharing mechanism, and inter-sentence continuity loss function.
训练完成的分类模型是利用大量的训练断句文本进行训练后得到的,向训练完成的分类模型输入连续断句,可以输出连续断句中各个测试断句在预设断句类型中对应的预测概率。The trained classification model is obtained by using a large number of training sentence texts for training. Continuous sentence input is input to the trained classification model, and the corresponding prediction probability of each test sentence in the continuous sentence in the preset sentence type can be output.
130、根据预测分类结果确定与各个测试断句分别对应的预测断句类型。130. Determine the predicted sentence type corresponding to each test sentence according to the predicted classification result.
根据预测分类结果包括的各个测试断句分别对应的预测概率,以及各个预设断句类型分别对应的概率阈值,确定各个测试断句中每个测试断句对应的预测断句类型。According to the prediction probabilities corresponding to each test sentence included in the prediction classification result and the probability thresholds corresponding to each preset sentence type, the predicted sentence type corresponding to each test sentence in each test sentence is determined.
例如,给定一个连续的三个测试断句,“打开车窗”、“打开车窗关”、“打开车窗关闭”,获得每个测试断句在预设断句类型中对应的预测概率,如果模型预测效果准确的话,则第一个测试断句属于第二类断句的预测概率最大,第二个测试断句属于第三类断句的预测概率最大,第三个测试断句属于第四类断句的预测概率最大。通过设置阈值可以确定预测分类结果是否可信。对于不同的预设断句类型,可以设置不同的阈值。例如,针对测试断句对应的预设断句类型是第二类断句的情况,若测试断句属于第二类断句的预测概率大于阈值时,就确认该测试断句为第二类断句。若三个测试断句属于第二类断句的预测概率分别是0.8、0.5、0.4,当阈值设置为0.6时,则预测概率为0.8的测试断句属于第二类断句。For example, given a continuous three test sentences, "open the car window", "open the car window to close", "open the car window to close", obtain the corresponding predicted probability of each test sentence in the preset sentence type, if the model If the prediction effect is accurate, the prediction probability of the first test sentence belonging to the second type of sentence is the largest, the prediction probability of the second test sentence belonging to the third type of sentence is the largest, and the prediction probability of the third test sentence belonging to the fourth type of sentence is the largest . Whether the predicted classification result is credible can be determined by setting a threshold. For different preset sentence segmentation types, different thresholds can be set. For example, in the case that the preset sentence type corresponding to the test sentence is the second type of sentence, if the predicted probability that the test sentence belongs to the second type of sentence is greater than the threshold, the test sentence is confirmed to be the second type of sentence. If the predicted probabilities of the three test sentences belonging to the second type of sentence are 0.8, 0.5, and 0.4 respectively, when the threshold is set to 0.6, the test sentence with a predicted probability of 0.8 belongs to the second type of sentence.
140、根据各个测试断句分别对应的预测断句类型对第一目标语音文本进行断句。140. Segment the first target speech text according to the predicted sentence type corresponding to each test sentence.
根据预设断句类型的分类规则对第一目标语音文本构造多个测试断句,使得分类模型能够对测试断句执行分类任务,可以以多个字组成的句子作为预测的单元,而不是以单独一个字作为预测的单元,很大程度上简化了模型的结构,减轻了模型的预测任务。Construct multiple test sentences for the first target speech text according to the classification rules of the preset sentence types, so that the classification model can perform classification tasks on the test sentences, and a sentence composed of multiple characters can be used as a prediction unit instead of a single character As a unit of prediction, it greatly simplifies the structure of the model and reduces the prediction task of the model.
请参阅图2,图2是本申请公开的另一种语音断句方法的流程示意图,该方法可应用于前述的任意一种电子设备。如图2所示,该方法包括以下步骤:Please refer to FIG. 2 . FIG. 2 is a schematic flow chart of another voice sentence segmentation method disclosed in the present application, which can be applied to any of the aforementioned electronic devices. As shown in Figure 2, the method includes the following steps:
201、获取初始语音文本。201. Acquire an initial speech text.
初始语音文本是从用户话语命令中采集的未经预处理的文本数据。即,可以是直接从采集到的音频信号转变为对应的文本数据后得到的文本数据。Naive speech text is unpreprocessed text data collected from user utterance commands. That is, it may be the text data obtained after directly converting the collected audio signal into corresponding text data.
202、将初始语音文本中与常用语音模板一致的初始断句文本从初始语音文本中删除,将删除了初始断句文本之后的初始语音文本确定为第一目标语音文本。202. Delete the initial sentence text in the initial speech text that is consistent with the common speech template from the initial speech text, and determine the initial speech text after the deletion of the initial sentence text as the first target speech text.
常用语音模板可以是与当前应用场景相匹配的多个常用语音命令,例如,在车载场景下,可以是“关闭车窗”、“打开空调”等,或者用户话语命令中的前缀词,比如“请”、“我要”等。基于常用语音模板,可以对用户的话语命令进行前缀以及部分动作词的适配,达到初步的语义理解,实现对初始语音文本的数据预处理。例如,常用语音模板是“关闭车窗”、“打开空调”,如果初始语音文本的前缀词或者部分动作词包括“关闭车窗”,正好与常用语音模板相同,则直接执行“关闭车窗”,并从初始语音文本中删除“关闭车窗”这四个字,将删除了这四个字之后的初始语音文本确定为第一目标语音文本。Commonly used voice templates can be multiple commonly used voice commands that match the current application scenario. For example, in a vehicle scenario, it can be "close the window", "turn on the air conditioner", etc., or a prefix word in the user's voice command, such as " Please", "I want" and so on. Based on the commonly used voice templates, the user's utterance commands can be prefixed and some action words can be adapted to achieve a preliminary semantic understanding and realize data preprocessing of the initial voice text. For example, the commonly used speech templates are "close the car window" and "turn on the air conditioner". If the prefix words or some action words of the initial speech text include "close the car window", which happens to be the same as the commonly used speech template, then directly execute "close the car window" , and delete the four words "close the car window" from the initial speech text, and determine the initial speech text after deleting these four words as the first target speech text.
基于模板对用户话语命令的前缀以及部分动作词进行适配,可以减少对模型的扰动。本申请的语音断句方法可以将话语断句任务拆解成两个子任务,第一个任务是预测初始语音文本中前若干个字是否能组成一个语义完整的句子,第二个任务是根据第一目标语音文本中的连续断句的预测分类结果,来判断连续断句中是否存在语义完整的句子,如果存在,就可进行断句。Adapting the prefix of the user's utterance command and some action words based on the template can reduce the disturbance to the model. The phonetic sentence segmentation method of the present application can decompose the discourse sentence segmentation task into two sub-tasks. The first task is to predict whether the first few words in the initial phonetic text can form a semantically complete sentence, and the second task is based on the first objective. The prediction and classification results of the continuous sentence in the speech text are used to judge whether there is a sentence with complete semantics in the continuous sentence, and if there is, the sentence can be carried out.
203、根据预设断句类型的分类规则,为第一目标语音文本构造连续断句。203. Construct consecutive sentence sentences for the first target speech text according to the classification rules of preset sentence types.
204、通过训练完成的分类模型获取连续断句的预测分类结果。204. Obtain the predicted classification result of the continuous sentence by using the trained classification model.
205、根据预测分类结果包括的各个测试断句分别对应的预测概率,以及各个预设断句类型分别对应的概率阈值,确定各个测试断句中每个测试断句对应的预测断句类型。205. Determine the predicted sentence type corresponding to each test sentence in each test sentence according to the prediction probabilities corresponding to each test sentence included in the prediction classification result and the probability thresholds corresponding to each preset sentence type.
步骤203-步骤205的实施方式可参见前述的实施例中的步骤110-步骤130,以下内容不再赘述。For the implementation manner of step 203-step 205, reference may be made to step 110-step 130 in the foregoing embodiment, and the following content will not be repeated.
206、若在连续断句包括的各个测试断句中连续出现预测断句类型为第二类断句、第三类断句、第四类断句的测试断句,则将预测断句类型为第二类断句的测试断句确定为第一断句文本。206. If the test sentences whose predicted sentence type is the second type of sentence, the third type of sentence, and the fourth type of sentence appear continuously in each test sentence included in the continuous sentence, then determine the test sentence whose predicted sentence type is the second type of sentence is the first sentence text.
例如,若在测试断句中连续出现“打开车窗”,“打开车窗关”,“打开车窗关闭”三个预测断句类型为第二类断句。第三类断句、第四类断句的情况,则可以确定预测断句类型为第二类断句的“打开车窗”为第一断句文本。For example, if the three predicted sentence types of "open the car window", "open the car window and close" and "open the car window and close" appear continuously in the test sentence, they are the second type of sentence. In the case of the third type of sentence sentence and the fourth type of sentence sentence, it can be determined that "open the car window" whose predicted sentence type is the second type of sentence sentence is the first sentence sentence text.
示例性的,若在连续断句中连续出现预测断句类型分别为第一类断句、第二类断句、第三类断句的测试断句时,则忽略该连续的测试断句,重新检测是否存在连续出现预测断句类型为第二类断句、第三类断句、第四类断句的测试断句。Exemplarily, if the test sentence types of the first type of sentence, the second type of sentence, and the third type of test sentence appear continuously in the continuous sentence, then ignore the continuous test sentence and re-check whether there is a continuous prediction Sentence types are test sentences of the second type of sentence, the third type of sentence, and the fourth type of sentence.
207、若第一断句文本符合业务逻辑规则,则将第一断句文本确定为对所述第一目标语音文本进行断句的断句结果。207. If the first sentence segmentation text conforms to the business logic rule, determine the first sentence segmentation text as a sentence segmentation result of the first target speech text.
208、若第一断句文本不符合业务逻辑规则,则忽略第一断句文本,不对第一目标语音文本进行断句。208. If the first sentence segmentation text does not comply with the business logic rule, ignore the first sentence segmentation text and do not perform sentence segmentation on the first target speech text.
在得出第一断句文本后,可以根据业务逻辑规则确认当前断句是否合理。示例性的,业务逻辑规则可以是当第一断句文本的最后一个字位于第一目标语音文本中的“和”字之前,不执行断开操作。例如,对于“打开车窗和空调”,如果第一断句文本是“打开车窗”,也就是说第一目标语音文本在“打开车窗”和“和空调”之间断开,则忽略第一断句文本,不对第一目标语音文本进行断句。After the first sentence text is obtained, it can be confirmed whether the current sentence is reasonable according to business logic rules. Exemplarily, the business logic rule may be that when the last character of the first segmented text is located before the word "and" in the first target phonetic text, the disconnection operation is not performed. For example, for "open the car window and air conditioner", if the first sentence text is "open the car window", that is to say, the first target speech text is disconnected between "open the car window" and "and air conditioner", then ignore the first Sentence text, do not perform sentence segmentation on the first target speech text.
209、获取对第一目标语音文本进行断句后对应的断点标记,并将滑动窗口滑动到断点标记之后,将滑动窗口内的文本确定为第二目标语音文本。209. Obtain a breakpoint mark corresponding to the sentence segmentation of the first target speech text, slide the sliding window to behind the breakpoint mark, and determine the text in the sliding window as the second target speech text.
滑动窗口是滑动窗口算法中的逻辑窗口,一般作用在字符串或者数组上。通过不断地滑动窗口,可以在特定大小的窗口内运行算法。窗口在每一次滑动前后,中间元素内容没有改变,仅仅改变的是开头和结尾元素。也就是说,下一窗口内元素之和=上一窗口元素和-离开窗口元素值+新加入窗口元素值。窗口滑动过程中,需要删除滑出窗口的元素以及新增滑入窗口的元素。The sliding window is the logical window in the sliding window algorithm, which generally acts on strings or arrays. By continuously sliding the window, the algorithm can be run within a window of a certain size. Before and after each sliding of the window, the content of the middle element does not change, only the beginning and end elements change. That is to say, the sum of the elements in the next window = the sum of the elements in the previous window - the value of the element leaving the window + the value of the element newly added to the window. During the window sliding process, it is necessary to delete elements that slide out of the window and add elements that slide into the window.
第二目标语音文本是从第一目标语音文本中删除了第一断句文本后剩下 的文本内容。在对第一目标语音文本进行断句后,将滑动窗口滑动到对应的断点标记后,第一断句文本离开了滑动窗口,滑动窗口的起点从第二目标语音文本开始。The second target phonetic text is the remaining text content after deleting the first sentence text from the first target phonetic text. After segmenting the first target phonetic text, after sliding the sliding window to the corresponding breakpoint mark, the first segmented text leaves the sliding window, and the starting point of the sliding window starts from the second target phonetic text.
由于第一目标语音文本可以包含多个用户指令,而第一断句文本只是对应其中一个用户指令,因此对于第一目标语音文本中第一断句文本以外的其他文本内容要继续执行步骤120~140的过程,直到将第一目标语音文本中包含的所有用户指令都执行完。Since the first target speech text may contain multiple user instructions, and the first sentence sentence text only corresponds to one of the user instructions, for other text contents in the first target speech text other than the first sentence sentence text, continue to perform steps 120-140. process until all user instructions contained in the first target voice text are executed.
210、利用训练完成的分类模型获取第二目标语音文本的预测分类结果,并利用第二目标语音文本的预测分类结果对第二目标语音文本进行断句。210. Use the trained classification model to obtain a predicted classification result of the second target speech text, and use the predicted classification result of the second target speech text to segment the second target speech text.
步骤210的实施方式与前述的步骤203-步骤209类似,以下内容不再赘述。The implementation of step 210 is similar to the aforementioned steps 203-209, and the following content will not be repeated.
滑动窗口基于分类模型的效果,充分利用了分类模型的分类结果,可以排除第一断句文本对于后续断句过程的干扰,专注于对第二目标语音文本进行断句。例如,第一目标语音文本是“关闭车窗打开空调氛围灯打开”,确定的第一断句文本是“关闭车窗”,则第二目标语音文本为“打开空调氛围灯打开”。Based on the effect of the classification model, the sliding window makes full use of the classification results of the classification model, and can eliminate the interference of the first sentence text on the subsequent sentence segmentation process, and focus on the sentence segmentation of the second target speech text. For example, if the first target voice text is "close the car window, turn on the air conditioner and turn on the ambient light", and the determined first sentence text is "close the car window", then the second target speech text is "turn on the air conditioner and turn on the ambient light".
传统序列标注的语音断句方法是通过一个模型预测一个短句中每个字的类别是开头、中间、结尾、无关中的一种。不足在于,单个模型任务太重,需要预测每一个字的类别。The speech segmentation method of traditional sequence labeling uses a model to predict the category of each word in a short sentence as one of the beginning, middle, end, and irrelevant. The disadvantage is that the single model task is too heavy and needs to predict the category of each word.
本申请的语音断句方法无需通过模型确定第一目标语音文本中的每个字的类别,而是根据第一目标语音文本中各个测试断句对应的预测断句类型,对第一目标语音文本进行断句。同时,使用滑动窗口可以实现更优化的断点选择,获得了更高的容错性,达到了更高的精确率、召回率和句子准确率。The phonetic sentence segmentation method of the present application does not need to use a model to determine the category of each word in the first target phonetic text, but segments the first target phonetic text according to the predicted sentence type corresponding to each test sentence in the first target phonetic text. At the same time, the use of sliding windows can achieve more optimal breakpoint selection, obtain higher fault tolerance, and achieve higher precision, recall and sentence accuracy.
如下表1示例了序列标注语音断句方法和滑动窗口语音断句方法的精确率、召回率和句子准确率的对比情况。The following table 1 illustrates the comparison of the precision rate, recall rate and sentence accuracy rate between the sequence tagging speech segmentation method and the sliding window speech segmentation method.
 the 精确率Accuracy 召回率recall rate 句子准确率Sentence accuracy
滑动窗口语音断句Sliding window speech segmentation 97.83%97.83% 93.65%93.65% 87.8%87.8%
序列标注语音断句2Sequence Annotation Speech Sentence 2 91.5%91.5% 89.1%89.1% 84%84%
序列标注语音断句1Sequence Annotation Speech Sentence 1 89.5%89.5% 90%90% 78%78%
表1滑动窗口语音断句方法和序列标注语音断句方法测试结果示例Table 1 Example of test results of sliding window speech segmentation method and sequence annotation speech segmentation method
请参阅图3,图3是本申请公开的另一种语音断句方法的流程示意图。Please refer to FIG. 3 . FIG. 3 is a schematic flow chart of another speech segmentation method disclosed in the present application.
310、获取样本数据。310. Obtain sample data.
样本数据包括的多个训练断句文本是根据预设断句类型的分类规则生成的。The multiple training sentence segmentation texts included in the sample data are generated according to classification rules of preset sentence segmentation types.
多个训练断句文本可以是根据不同应用场景的特点,人工挑选出来的一定数量的用户常用的指令,能够用于描述用户在该应用场景的需求。The multiple training texts may be a certain number of commonly used instructions of users manually selected according to the characteristics of different application scenarios, which can be used to describe the needs of users in the application scenarios.
可选的,若预设断句类型至少包括第一类断句,第二类断句,第三类断句,第四类断句,则多个训练断句文本中也可包括与第一类断句、第二类断句、第三类断句和第四类断句分别对应的训练断句文本。Optionally, if the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence, then a plurality of training sentence texts can also include sentences related to the first type of sentence, the second type of sentence Sentence sentences, sentence sentences of the third type, and sentence sentences of the fourth type correspond to the training sentence texts respectively.
此外,样本数据包括的每个训练断句文本可对应有真实断句类型,真实断句类型可以是人工标记的,也可以是基于其它分类方法识别出的准确的断句类型,具体不做限定。In addition, each training sentence text included in the sample data may correspond to a real sentence type. The real sentence type may be manually marked, or it may be an accurate sentence type identified based on other classification methods, which is not specifically limited.
320、从样本数据中选取出训练断句文本。320. Select training sentence segmentation text from the sample data.
从样本数据中选取训练断句文本的方式可以是随机选取,可以是根据预设断句类型的分类规则选取出的连续断句,具体不作限定。The method of selecting training sentence texts from the sample data may be random selection, and may be continuous sentence selections selected according to classification rules of preset sentence types, which is not specifically limited.
330、将训练断句文本输入到待训练的分类模型中,得到训练断句文本的训练分类结果。330. Input the training sentence segmentation text into the classification model to be trained, and obtain the training classification result of the training sentence segmentation text.
训练分类结果包括训练断句文本在预设断句类型中对应的预测概率。The training classification results include prediction probabilities corresponding to the training sentence texts in preset sentence types.
340、根据所述训练断句文本的训练分类结果确定训练断句文本的训练断句类型。340. Determine the training sentence type of the training sentence text according to the training classification result of the training sentence text.
根据训练分类结果包括的各个训练断句文本分别对应的预测概率,以及各个预设断句类型分别对应的概率阈值,确定各个训练断句文本对应的预测断句类型。According to the prediction probabilities corresponding to the respective training sentence segmentation texts included in the training classification results, and the probability thresholds corresponding to the respective preset sentence segmentation types, the predicted sentence segmentation types corresponding to each training sentence segmentation text are determined.
350、根据训练断句文本的训练断句类型与训练断句文本对应的真实断句类型计算训练损失,并根据训练损失对待训练的分类模型中的参数进行调整,以得到训练完成的分类模型。350. Calculate the training loss according to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, and adjust the parameters in the classification model to be trained according to the training loss to obtain a trained classification model.
计算的损失可以是L1损失、L2损失、交叉熵损失等,但不限于此。The calculated loss may be L1 loss, L2 loss, cross-entropy loss, etc., but is not limited thereto.
对模型参数的调整方法可以是梯度下降法、网格搜索法、随机搜索法、贝叶斯优化法等,但不限于此。The method for adjusting model parameters may be gradient descent method, grid search method, random search method, Bayesian optimization method, etc., but is not limited thereto.
前述的步骤310-步骤350可以是对分类模型进行训练的过程。训练后得 到的分类模型输出的预测概率较为准确,可应用于如车载语音识别等场景中,执行如下述步骤360-步骤390,对语音指令进行断句。The aforementioned steps 310 to 350 may be a process of training the classification model. The predicted probability output by the classification model obtained after training is relatively accurate, and can be applied to scenarios such as vehicle-mounted speech recognition. The following steps 360-390 are performed to segment voice commands.
360、根据预设断句类型的分类规则,为第一目标语音文本构造连续断句。360. According to the classification rules of preset sentence types, construct continuous sentence sentences for the first target speech text.
370、通过训练完成的分类模型获取连续断句的预测分类结果。370. Obtain the predicted classification result of the continuous sentence by using the trained classification model.
380、根据预测分类结果确定与各个测试断句分别对应的预测断句类型。380. Determine the predicted sentence type corresponding to each test sentence according to the predicted classification result.
390、根据各个测试断句分别对应的预测断句类型对第一目标语音文本进行断句。390. Segment the first target speech text according to the predicted sentence type corresponding to each test sentence.
请参阅图4,图4是本申请公开的一种语音断句装置的结构示意图。该装置可应用于车载终端等电子设备,具体不做限定。如图4所示,语音断句装置400可包括:构造模块410、获取模块420、确定模块430、断句模块640。Please refer to FIG. 4 . FIG. 4 is a schematic structural diagram of a speech punctuation device disclosed in the present application. The device can be applied to electronic devices such as vehicle terminals, and is not specifically limited. As shown in FIG. 4 , the speech segmentation device 400 may include: a construction module 410 , an acquisition module 420 , a determination module 430 , and a sentence segmentation module 640 .
构造模块410,用于根据预设断句类型的分类规则,为第一目标语音文本构造连续断句,连续断句包括的多个测试断句是从第一目标语音文本中截取出来,且与不同的预设断句类型分别对应的文本片段;The construction module 410 is used to construct continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and a plurality of test sentences included in the continuous sentences are intercepted from the first target speech text, and are different from the preset Text fragments corresponding to the sentence types;
获取模块420,用于通过训练完成的分类模型获取连续断句的预测分类结果;预测分类结果包括各个测试断句在预设断句类型中对应的预测概率;The obtaining module 420 is used to obtain the predicted classification results of continuous sentences through the classification model trained; the predicted classification results include the corresponding predicted probabilities of each test sentence in the preset sentence type;
确定模块430,用于根据预测分类结果确定与各个测试断句分别对应的预测断句类型;Determining module 430, is used for determining the predicted sentence type corresponding to each test sentence according to the prediction classification result;
断句模块440,用于根据各个测试断句分别对应的预测断句类型对第一目标语音文本进行断句。The sentence segmentation module 440 is configured to segment the first target speech text according to the predicted sentence type corresponding to each test sentence.
在一个实施例中,确定模块430,可用于根据预测分类结果包括的各个测试断句分别对应的预测概率,以及各个预设断句类型分别对应的概率阈值,确定各个测试断句中每个测试断句对应的预测断句类型。In one embodiment, the determination module 430 can be used to determine the corresponding prediction probability of each test sentence in each test sentence according to the prediction probability corresponding to each test sentence included in the prediction classification result, and the probability threshold corresponding to each preset sentence type. Predicted sentence type.
在一个实施例中,预设断句类型至少包括第一类断句,第二类断句,第三类断句,第四类断句;第一类断句包括非完整指令的文本,第二类断句包括完整指令的文本,第三类断句包括完整指令文本以及除了完整指令的文本以外的N个字的增量文本,第四类断句包括完整指令文本以及除了完整指令的文本以外的M个字的增量文本;N和M为正整数,M大于N;In one embodiment, the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes the text of the incomplete instruction, and the second type of sentence includes the complete instruction The third type of sentence includes the complete instruction text and the incremental text of N words other than the text of the complete instruction, and the fourth type of sentence includes the complete instruction text and the incremental text of M words other than the text of the complete instruction ; N and M are positive integers, and M is greater than N;
所述连续断句包括的多个测试断句至少与第一类断句,第二类断句,第三类断句和第四类断句分别对应。The multiple test sentences included in the continuous sentence at least correspond to the first type of sentence, the second type of sentence, the third type of sentence and the fourth type of sentence respectively.
在一个实施例中,断句模块440,还包括确定单元和断句单元。In one embodiment, the sentence segmentation module 440 further includes a determination unit and a sentence segmentation unit.
确定单元,可用于若在所述连续断句包括的各个测试断句中连续出现预测断句类型为第二类断句、第三类断句、第四类断句的所述测试断句,则将所述预测断句类型为第二类断句的所述测试断句确定为第一断句文本。The determining unit may be used for if the test sentences whose predicted sentence types are the second type of sentence, the third type of sentence, and the fourth type of sentence appear continuously in each test sentence included in the continuous sentence, then the predicted sentence type The test sentence that is the second type of sentence is determined as the first sentence text.
断句单元,可用于根据所述第一断句文本对所述第一目标语音文本进行断句。The sentence segmentation unit is configured to segment the first target speech text according to the first sentence segmentation text.
在一个实施例中,断句单元,还可用于若所述第一断句文本符合业务逻辑规则,则将所述第一断句文本确定为对所述第一目标语音文本进行断句的断句结果;若所述第一断句文本不符合所述业务逻辑规则,则忽略所述第一断句文本,不对所述第一目标语音文本进行断句。In one embodiment, the sentence segmentation unit can also be used to determine the first sentence sentence text as the sentence segmentation result of the first target speech text if the first sentence sentence text conforms to the business logic rule; if the If the first sentence segmentation text does not conform to the business logic rule, the first sentence segmentation text is ignored, and no sentence segmentation is performed on the first target speech text.
在一个实施例中,该语音断句装置,还包括滑动模块,用于获取对所述第一目标语音文本进行断句后对应的断点标记,并将滑动窗口滑动到所述断点标记之后,将所述滑动窗口内的文本确定为第二目标语音文本;利用所述训练完成的分类模型获取所述第二目标语音文本的预测分类结果,并利用所述第二目标语音文本的预测分类结果对所述第二目标语音文本进行断句。In one embodiment, the speech sentence segmentation device further includes a sliding module, configured to obtain a breakpoint mark corresponding to the sentence segmentation of the first target speech text, and slide the sliding window behind the breakpoint mark, and The text in the sliding window is determined as the second target phonetic text; the classification model using the trained classification model is used to obtain the predicted classification result of the second target phonetic text, and use the predicted classification result of the second target phonetic text to The second target speech text is segmented.
在一个实施例中,该语音断句装置,还包括预处理模块,用于获取初始语音文本;将所述初始语音文本中与常用语音模板一致的初始断句文本从所述初始语音文本中删除,将删除了所述初始断句文本之后的所述初始语音文本确定为所述第一目标语音文本。In one embodiment, the phonetic sentence segmentation device also includes a preprocessing module for obtaining an initial phonetic text; deleting the initial phonetic text in the initial phonetic text that is consistent with a commonly used phonetic template from the initial phonetic text, and The initial speech text after deleting the initial sentence sentence text is determined as the first target speech text.
在一个实施例中,该语音断句装置,还可用于模型训练装置500。请参阅图5,图5是本申请公开的一种模型训练装置的结构示意图。该模型训练装置可应用于服务器、计算机等运算能力较强的电子设备;或者,该模型训练装置也可应用于车载终端等运算能力较弱的终端设备,具体不做限定。模型训练装置500可包括:获取模块510,选取模块520,训练模块530,确定模块540,调整模块550。In one embodiment, the speech sentence segmentation device can also be used in the model training device 500 . Please refer to FIG. 5 . FIG. 5 is a schematic structural diagram of a model training device disclosed in the present application. The model training device can be applied to electronic devices with strong computing capabilities such as servers and computers; or, the model training device can also be applied to terminal devices with weak computing capabilities such as vehicle terminals, which are not specifically limited. The model training device 500 may include: an acquisition module 510 , a selection module 520 , a training module 530 , a determination module 540 , and an adjustment module 550 .
获取模块510,用于获取样本数据;样本数据包括的多个训练断句文本是根据预设断句类型的分类规则生成的;An acquisition module 510, configured to acquire sample data; a plurality of training sentence sentence texts included in the sample data are generated according to classification rules of preset sentence type;
选取模块520,用于从样本数据中选取出训练断句文本;The selection module 520 is used to select the training sentence sentence text from the sample data;
训练模块530,用于将训练断句文本输入到待训练的分类模型中,得到训练断句文本的训练分类结果;训练分类结果包括训练断句文本中的每个样本断句文本在预设断句类型中对应的预测概率;The training module 530 is used to input the training sentence sentence text into the classification model to be trained to obtain the training classification result of the training sentence sentence text; the training classification result includes each sample sentence sentence text in the training sentence sentence text corresponding to the preset sentence type predicted probability;
确定模块540,根据所述训练断句文本的训练分类结果用于确定所述训练断句文本的训练断句类型;Determining module 540 is used to determine the training sentence type of the training sentence text according to the training classification result of the training sentence text;
调整模块550,用于根据所述训练断句文本的训练断句类型与所述训练断句文本对应的真实断句类型计算训练损失,并根据所述训练损失对所述待训练的分类模型中的参数进行调整,以得到训练完成的分类模型。The adjustment module 550 is used to calculate the training loss according to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, and adjust the parameters in the classification model to be trained according to the training loss , to get the trained classification model.
请参阅图6,图6是本申请公开的一种电子设备的结构示意图。如图6所示,该电子设备600可以包括:Please refer to FIG. 6 . FIG. 6 is a schematic structural diagram of an electronic device disclosed in the present application. As shown in FIG. 6, the electronic device 600 may include:
存储有可执行程序代码的存储器610;a memory 610 storing executable program code;
与存储器610耦合的处理器620;a processor 620 coupled to the memory 610;
其中,处理器620调用存储器610中存储的可执行程序代码,执行本申请公开的任一种语音断句方法。Wherein, the processor 620 invokes the executable program code stored in the memory 610 to execute any one of the voice sentence segmentation methods disclosed in this application.
本申请公开一种计算机可读存储介质,其存储计算机程序,其中,计算机程序被所述处理器执行时,使得所述处理器实现本申请公开的任意一种语音断句方法。The present application discloses a computer-readable storage medium, which stores a computer program, wherein, when the computer program is executed by the processor, the processor is made to implement any one of the speech segmentation methods disclosed in the present application.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定特征、结构或特性可以以任意适合的方式结合在一个或多个实施例中。本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by this application.
在本申请的各种实施例中,应理解,上述各过程的序号的大小并不意味着执行顺序的必然先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施过程构成任何限定。In various embodiments of the present application, it should be understood that the sequence numbers of the above-mentioned processes do not necessarily mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the requirements of the present application. The implementation process constitutes no limitation.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物单元,即可位于一个地方,或者也可以分布到多个网络单元上。可根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, located in one place, or distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单 元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
上述集成的单元若以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可获取的存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或者部分,可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干请求用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请的各个实施例上述方法的部分或全部步骤。If the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a memory , including several requests to make a computer device (which may be a personal computer, server, or network device, etc., specifically, a processor in the computer device) execute some or all of the steps of the above-mentioned methods in various embodiments of the present application.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质包括只读存储器(Read-Only Memory,ROM)、随机存储器(Random Access Memory,RAM)、可编程只读存储器(Programmable Read-only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、一次可编程只读存储器(One-time Programmable Read-Only Memory,OTPROM)、电子抹除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium includes read-only Memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), programmable read-only memory (Programmable Read-only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), One-time Programmable Read-Only Memory (OTPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc (Compact Disc Read-Only Memory, CD-ROM) or other optical disk storage, magnetic disk storage, tape storage, or any other computer-readable medium that can be used to carry or store data.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims (11)

  1. 一种语音断句方法,其特征在于,所述方法包括:A method for speech punctuation, characterized in that the method comprises:
    根据预设断句类型的分类规则,为第一目标语音文本构造连续断句,所述连续断句包括的多个测试断句是从第一目标语音文本中截取出来,且与不同的预设断句类型分别对应的文本片段;According to the classification rules of the preset sentence types, a continuous sentence is constructed for the first target speech text, and the plurality of test sentence sentences included in the continuous sentence are intercepted from the first target speech text and correspond to different preset sentence types respectively the text fragment of
    通过训练完成的分类模型获取所述连续断句的预测分类结果;所述预测分类结果包括各个所述测试断句在所述预设断句类型中对应的预测概率;Obtaining the predicted classification results of the continuous sentences through the trained classification model; the predicted classification results include the corresponding predicted probabilities of each of the test sentences in the preset sentence types;
    根据所述预测分类结果确定与各个所述测试断句分别对应的预测断句类型;Determining the predicted sentence types corresponding to each of the test sentences according to the predicted classification results;
    根据各个所述测试断句分别对应的预测断句类型对所述第一目标语音文本进行断句。The first target speech text is segmented according to the predicted sentence type corresponding to each test sentence.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述预测分类结果确定与各个所述测试断句分别对应的预测断句类型,包括:The method according to claim 1, wherein said determining the predicted sentence types respectively corresponding to each said test sentence according to said predicted classification results comprises:
    根据所述预测分类结果包括的各个测试断句分别对应的预测概率,以及各个预设断句类型分别对应的概率阈值,确定所述各个测试断句中每个测试断句对应的预测断句类型。According to the predicted probabilities corresponding to the respective test sentences included in the prediction and classification results, and the probability thresholds corresponding to the respective preset sentence types, the predicted sentence type corresponding to each test sentence in the various test sentences is determined.
  3. 根据权利要求1所述的方法,其特征在于,所述预设断句类型至少包括第一类断句,第二类断句,第三类断句,第四类断句;所述第一类断句包括非完整指令的文本,所述第二类断句包括完整指令的文本,所述第三类断句包括完整指令文本以及除了完整指令的文本以外的N个字的增量文本,所述第四类断句包括完整指令文本以及除了完整指令的文本以外的M个字的增 量文本;所述N和所述M为正整数,所述M大于所述N;The method according to claim 1, wherein the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes incomplete sentences. The text of the instruction, the second type of sentence includes the text of the complete instruction, the third type of sentence includes the complete instruction text and the incremental text of N characters other than the text of the complete instruction, and the fourth type of sentence includes the complete instruction text The instruction text and the incremental text of M words except the text of the complete instruction; the N and the M are positive integers, and the M is greater than the N;
    所述连续断句包括的多个测试断句至少与第一类断句,第二类断句,第三类断句和第四类断句分别对应。The multiple test sentences included in the continuous sentence at least correspond to the first type of sentence, the second type of sentence, the third type of sentence and the fourth type of sentence respectively.
  4. 根据权利要求1所述的方法,其特征在于,所述根据各个所述测试断句对应的预测断句类型对所述第一目标语音文本进行断句,包括:The method according to claim 1, wherein said segmenting said first target speech text according to the predicted segment type corresponding to each said test segment comprises:
    若在所述连续断句包括的各个测试断句中连续出现预测断句类型为第二类断句、第三类断句、第四类断句的所述测试断句,则将所述预测断句类型为第二类断句的所述测试断句确定为第一断句文本;If in each test sentence that the continuous sentence includes, the predicted sentence type is the second type of sentence, the third type of sentence, and the fourth type of test sentence, and then the predicted sentence type is the second type of sentence. The test sentence of is determined as the first sentence text;
    根据所述第一断句文本对所述第一目标语音文本进行断句。Segmenting the first target phonetic text according to the first sentence-segmenting text.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第一断句文本对所述第一目标语音文本进行断句,包括:The method according to claim 4, wherein said segregating said first target phonetic text according to said first sentence-segmenting text comprises:
    若所述第一断句文本符合业务逻辑规则,则将所述第一断句文本确定为对所述第一目标语音文本进行断句的断句结果;If the first sentence-sentence text conforms to the business logic rule, then determining the first sentence-sentence text as the sentence-segmentation result of the first target speech text;
    若所述第一断句文本不符合所述业务逻辑规则,则忽略所述第一断句文本,不对所述第一目标语音文本进行断句。If the first sentence segmentation text does not conform to the business logic rule, then ignore the first sentence segmentation text, and do not perform sentence segmentation on the first target speech text.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,在根据各个所述测试断句对应的预测断句类型对所述第一目标语音文本进行断句之后,所述方法还包括:The method according to any one of claims 1-5, wherein after the first target speech text is segmented according to the predicted sentence type corresponding to each of the test sentences, the method further comprises:
    获取对所述第一目标语音文本进行断句后对应的断点标记,并将滑动窗 口滑动到所述断点标记之后,将所述滑动窗口内的文本确定为第二目标语音文本;Obtain the corresponding breakpoint marker after the first target speech text is segmented, and slide the sliding window after the breakpoint marker, and determine the text in the sliding window as the second target speech text;
    利用所述训练完成的分类模型获取所述第二目标语音文本的预测分类结果,并利用所述第二目标语音文本的预测分类结果对所述第二目标语音文本进行断句。Using the trained classification model to obtain a predicted classification result of the second target speech text, and using the predicted classification result of the second target speech text to segment the second target speech text.
  7. 根据权利要求1-5任一项所述的方法,其特征在于,在所述通过训练完成的分类模型对第一目标语音文本进行分类预测之前,所述方法还包括:The method according to any one of claims 1-5, wherein, before the classification model completed by training is used to classify and predict the first target speech text, the method further comprises:
    获取初始语音文本;Get the initial speech text;
    将所述初始语音文本中与常用语音模板一致的初始断句文本从所述初始语音文本中删除,将删除了所述初始断句文本之后的所述初始语音文本确定为所述第一目标语音文本。Deleting the initial sentence text in the initial speech text that is consistent with the common speech template from the initial speech text, and determining the initial speech text after the deletion of the initial sentence text as the first target speech text.
  8. 根据权利要求1所述的方法,其特征在于,在所述根据预设断句类型的分类规则,对第一目标语音文本构造连续断句之前,所述方法包括:The method according to claim 1, characterized in that, before the classification rules of the preset sentence type are used to construct continuous sentence sentences for the first target phonetic text, the method comprises:
    获取样本数据;所述样本数据包括的多个训练断句文本是根据预设断句类型的分类规则生成的;Obtain sample data; a plurality of training sentence sentence texts included in the sample data are generated according to classification rules of preset sentence type;
    从所述样本数据中选取出训练断句文本;Select the training sentence sentence text from the sample data;
    将所述训练断句文本输入到待训练的分类模型中,得到所述训练断句文本的训练分类结果;所述训练分类结果包括所述训练断句文本中的每个样本断句文本在预设断句类型中对应的预测概率;The training sentence sentence text is input into the classification model to be trained to obtain the training classification result of the training sentence sentence text; the training classification result includes each sample sentence sentence text in the training sentence sentence text in the preset sentence sentence type The corresponding predicted probability;
    根据所述训练断句文本的训练分类结果确定所述训练断句文本的训练断 句类型;Determine the training sentence type of the training sentence text according to the training classification result of the training sentence text;
    根据所述训练断句文本的训练断句类型与所述训练断句文本对应的真实断句类型计算训练损失,并根据所述训练损失对所述待训练的分类模型中的参数进行调整,以得到训练完成的分类模型。According to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, the training loss is calculated, and the parameters in the classification model to be trained are adjusted according to the training loss, so as to obtain the training completion. classification model.
  9. 一种语音断句装置,其特征在于,所述装置包括:A device for speech punctuation, characterized in that the device comprises:
    构造模块,用于根据预设断句类型的分类规则,对第一目标语音文本构造连续断句,所述连续断句包括的多个测试断句是从第一目标语音文本中截取出来的与不同的预设断句类型分别对应的文本片段;The construction module is used for constructing continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and the plurality of test sentences included in the continuous sentences are intercepted from the first target speech text and different preset Text fragments corresponding to the sentence types;
    获取模块,用于通过训练完成的分类模型获取所述测试断句的预测分类结果;所述预测分类结果包括所述测试断句在所述预设断句类型中对应的预测概率;An acquisition module, configured to acquire the predicted classification result of the test sentence through the trained classification model; the predicted classification result includes the predicted probability corresponding to the test sentence in the preset sentence type;
    确定模块,用于根据所述预测分类结果确定与所述测试断句对应的预测断句类型;A determining module, configured to determine a predicted sentence type corresponding to the test sentence according to the predicted classification result;
    断句模块,用于根据各个所述测试断句对应的预测断句类型对所述第一目标语音文本进行断句。The sentence segmentation module is configured to segment the first target speech text according to the predicted sentence type corresponding to each of the test sentences.
  10. [根据细则26改正 17.01.2023]
    一种电子设备,其特征在于,包括存储器及处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器实现如权利要求1-7或8任一项所述的方法。
    [Corrected 17.01.2023 under Rule 26]
    An electronic device, characterized in that it includes a memory and a processor, and a computer program is stored in the memory, and when the computer program is executed by the processor, the processor is made to realize the requirements of claims 1-7 or 8. any one of the methods described.
  11. [根据细则26改正 17.01.2023]
    一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7或8任一项所述的方法。
    [Corrected 17.01.2023 under Rule 26]
    A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1-7 or 8 when executed by a processor.
PCT/CN2022/140275 2022-01-04 2022-12-20 Speech sentence segmentation method and apparatus, electronic device, and storage medium WO2023130951A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210001104.9A CN114420102B (en) 2022-01-04 2022-01-04 Method and device for speech sentence-breaking, electronic equipment and storage medium
CN202210001104.9 2022-01-04

Publications (1)

Publication Number Publication Date
WO2023130951A1 true WO2023130951A1 (en) 2023-07-13

Family

ID=81271294

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140275 WO2023130951A1 (en) 2022-01-04 2022-12-20 Speech sentence segmentation method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN114420102B (en)
WO (1) WO2023130951A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420102B (en) * 2022-01-04 2022-10-14 广州小鹏汽车科技有限公司 Method and device for speech sentence-breaking, electronic equipment and storage medium
CN115579009B (en) * 2022-12-06 2023-04-07 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9111547B2 (en) * 2012-08-22 2015-08-18 Kodak Alaris Inc. Audio signal semantic concept classification method
CN110264997A (en) * 2019-05-30 2019-09-20 北京百度网讯科技有限公司 The method, apparatus and storage medium of voice punctuate
CN110705254A (en) * 2019-09-27 2020-01-17 科大讯飞股份有限公司 Text sentence-breaking method and device, electronic equipment and storage medium
CN111161711A (en) * 2020-04-01 2020-05-15 支付宝(杭州)信息技术有限公司 Method and device for sentence segmentation of flow type speech recognition text
CN112711939A (en) * 2020-12-23 2021-04-27 深圳壹账通智能科技有限公司 Sentence-breaking method, device, equipment and storage medium based on natural language
CN114420102A (en) * 2022-01-04 2022-04-29 广州小鹏汽车科技有限公司 Method and device for speech sentence-breaking, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628819B (en) * 2017-03-16 2022-09-20 北京搜狗科技发展有限公司 Processing method and device for processing
CN108549628B (en) * 2018-03-16 2021-08-27 云知声智能科技股份有限公司 Sentence-breaking device and method for stream type natural language information
CN109325237B (en) * 2018-10-22 2023-06-13 传神语联网网络科技股份有限公司 Complete sentence recognition method and system for machine translation
CN111160003B (en) * 2018-11-07 2023-12-08 北京猎户星空科技有限公司 Sentence breaking method and sentence breaking device
CN111950256A (en) * 2020-06-23 2020-11-17 北京百度网讯科技有限公司 Sentence break processing method and device, electronic equipment and computer storage medium
CN112380343A (en) * 2020-11-19 2021-02-19 平安科技(深圳)有限公司 Problem analysis method, problem analysis device, electronic device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9111547B2 (en) * 2012-08-22 2015-08-18 Kodak Alaris Inc. Audio signal semantic concept classification method
CN110264997A (en) * 2019-05-30 2019-09-20 北京百度网讯科技有限公司 The method, apparatus and storage medium of voice punctuate
CN110705254A (en) * 2019-09-27 2020-01-17 科大讯飞股份有限公司 Text sentence-breaking method and device, electronic equipment and storage medium
CN111161711A (en) * 2020-04-01 2020-05-15 支付宝(杭州)信息技术有限公司 Method and device for sentence segmentation of flow type speech recognition text
CN112711939A (en) * 2020-12-23 2021-04-27 深圳壹账通智能科技有限公司 Sentence-breaking method, device, equipment and storage medium based on natural language
CN114420102A (en) * 2022-01-04 2022-04-29 广州小鹏汽车科技有限公司 Method and device for speech sentence-breaking, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114420102B (en) 2022-10-14
CN114420102A (en) 2022-04-29

Similar Documents

Publication Publication Date Title
US11651163B2 (en) Multi-turn dialogue response generation with persona modeling
WO2023130951A1 (en) Speech sentence segmentation method and apparatus, electronic device, and storage medium
US10061766B2 (en) Systems and methods for domain-specific machine-interpretation of input data
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
WO2020228732A1 (en) Method for training dialog state tracker, and computer device
CN110245348B (en) Intention recognition method and system
CN109840287A (en) A kind of cross-module state information retrieval method neural network based and device
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
KR20190120353A (en) Speech recognition methods, devices, devices, and storage media
WO2021073298A1 (en) Speech information processing method and apparatus, and intelligent terminal and storage medium
US10943600B2 (en) Systems and methods for interrelating text transcript information with video and/or audio information
CA2899532A1 (en) Method and device for acoustic language model training
KR20230040951A (en) Speech recognition method, apparatus and device, and storage medium
WO2022042125A1 (en) Named entity recognition method
JP2020004382A (en) Method and device for voice interaction
US20220044671A1 (en) Spoken language understanding
US11935315B2 (en) Document lineage management system
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN110263345B (en) Keyword extraction method, keyword extraction device and storage medium
CN112446219A (en) Chinese request text intention analysis method
US20230096070A1 (en) Natural-language processing across multiple languages
US11822893B2 (en) Machine learning models for detecting topic divergent digital videos
CN113553398B (en) Search word correction method, search word correction device, electronic equipment and computer storage medium
WO2021082570A1 (en) Artificial intelligence-based semantic identification method, device, and semantic identification apparatus
Prasetyo et al. Implementation voice command system for soccer robot ERSOW

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22918395

Country of ref document: EP

Kind code of ref document: A1