WO2023130951A1

WO2023130951A1 - Speech sentence segmentation method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023130951A1
Application number: PCT/CN2022/140275
Authority: WO
Inventors: 李嘉辉; 肖畅; 翁志伟; 孙仿逊
Original assignee: 广州小鹏汽车科技有限公司
Priority date: 2022-01-04
Filing date: 2022-12-20
Publication date: 2023-07-13
Also published as: CN114420102B; CN114420102A

Abstract

A speech sentence segmentation method and apparatus, an electronic device, and a storage medium. The method comprises: constructing continuously segmented sentences for a first target speech text according to a classification rule of a preset segmented sentence type (110); obtaining a predicted classification result of the continuously segmented sentences by means of a trained classification model (120); determining, according to the predicted classification result, a predicted segmented sentence type respectively corresponding to each test segmented sentence (130); and performing sentence segmentation on the first target speech text according to the predicted segmented sentence type respectively corresponding to each test segmented sentence (140).

Description

Speech sentence segmentation method, device, electronic equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202210001104.9 and the application name "method, device, electronic equipment and storage medium for phonetic sentence segmentation" submitted to the State Intellectual Property Office on January 04, 2022, the entire content of which is incorporated by reference incorporated in this application.

technical field

The present application relates to the technical field of natural language processing, and in particular to a speech sentence segmentation method, device, electronic equipment and storage medium.

Background technique

Existing speech segmentation methods are mainly applied to intelligent dialogue systems in multiple application scenarios through natural language processing technology in the field of artificial intelligence. For example, in an intelligent dialogue system in a vehicle application scenario, the voice segmentation method can be used to identify multiple independent instructions contained in the user's voice command, so as to segment the voice command so as to reasonably execute each independent instruction. However, in practice, it is found that the existing common models for natural language processing need to predict the category of each word in the user's voice command when the speech is segmented, and there is a problem of heavy model tasks.

Contents of the invention

The first aspect of the present application provides a method for speech segmentation, including:

According to the classification rules of the preset sentence types, a continuous sentence is constructed for the first target speech text, and the plurality of test sentence sentences included in the continuous sentence are intercepted from the first target speech text and correspond to different preset sentence types respectively the text fragment of

Obtaining the predicted classification results of the continuous sentences through the trained classification model; the predicted classification results include the corresponding predicted probabilities of each of the test sentences in the preset sentence types;

Determining the predicted sentence types corresponding to each of the test sentences according to the predicted classification results;

The first target speech text is segmented according to the predicted sentence type corresponding to each test sentence.

The second aspect of the present application provides a speech sentence segmentation device, including:

The construction module is used for constructing continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and the plurality of test sentences included in the continuous sentences are intercepted from the first target speech text and different preset Text fragments corresponding to the sentence types;

An acquisition module, configured to acquire the predicted classification result of the test sentence through the trained classification model; the predicted classification result includes the predicted probability corresponding to the test sentence in the preset sentence type;

A determining module, configured to determine a predicted sentence type corresponding to the test sentence according to the predicted classification result;

The sentence segmentation module is configured to segment the first target speech text according to the predicted sentence type corresponding to each of the test sentences.

The third aspect of the present application provides an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor can implement any one of the methods disclosed in the present application. A method of phonetic sentence segmentation.

The fourth aspect of the present application provides a computer-readable storage medium, which stores a computer program, wherein the computer program enables the computer to execute any one of the speech segmentation methods disclosed in the present application.

According to the phonetic sentence segmentation method provided by the application, according to the classification rules of the preset sentence type, a plurality of text fragments respectively corresponding to different preset sentence types are intercepted from the first target speech text, and the multiple text fragments are used as multiple Test sentences; input multiple test sentences into the trained classification model, and output the corresponding prediction probability of the test sentences in each preset sentence type from the trained classification model; The prediction probability of , after determining the predicted sentence type corresponding to the test sentence, sentence the first target speech text. The speech segmentation method in this application can classify the user's voice command into multiple speech sentences, and confirm whether the sentence type of the current speech sentence is a complete independent sentence, without using the training model to detect the category of each word in the user's voice command , which simplifies the model structure and improves the accuracy of speech segmentation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

The above and other objects, features and advantages of the present application will become more apparent by describing the exemplary embodiments of the present application in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present application, the same reference numerals generally represent same parts.

Fig. 1 is the schematic flow chart of a kind of speech segmentation method disclosed in the present application;

Fig. 2 is the schematic flow chart of another kind of voice sentence segmentation method disclosed in the present application;

Fig. 3 is the schematic flow chart of another kind of phonetic punctuation method disclosed in the present application;

Fig. 4 is a schematic structural diagram of a speech sentence segmentation device disclosed in the present application;

Fig. 5 is a schematic structural diagram of a model training device disclosed in the present application;

FIG. 6 is a schematic structural diagram of an electronic device disclosed in the present application.

Detailed ways

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.

The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

The application discloses a phonetic sentence segmentation method, device, electronic equipment and storage medium, which improves the accuracy of phonetic sentence segmentation, and will be described in detail below.

The technical solution of the present application will be described in detail below in conjunction with the accompanying drawings.

Please refer to FIG. 1 . FIG. 1 is a schematic flow chart of a speech segmentation method disclosed in the present application. This method can be applied to various smart terminals, such as smart phones, smart homes, wearable devices, vehicle-mounted terminals and other electronic devices, which are not specifically limited. The usage scenarios of this method can be industries and scenarios such as smart home, vehicle voice, intelligent customer service, medical scenarios, and industrial scenarios.

For example, in the car voice scene, when the user needs to issue a voice command to control the system through the car intelligent dialogue system, when the voice command contains multiple independent instructions, such as "turn on the air conditioner, close the window, and turn on the ambient light to play music", it is necessary to Segment the voice command so that each independent instruction contained in the user voice command can be reasonably executed.

As shown in Figure 1, the method includes the following steps:

110. Construct continuous sentences for the first target speech text according to the classification rules of preset sentence types.

The multiple test sentences included in the continuous sentence are text segments that are extracted from the first target speech text and respectively correspond to different preset sentence types.

The first target speech text is generated by performing text preprocessing after converting the audio signal in the user's speech command into corresponding text data through the process of recognition and understanding.

Among them, the first target voice text has the characteristics of corresponding application scenarios for different application scenarios. , for example, in the usage scenario of smart audio, the obtained first target voice text may include: search for popular songs, turn up the volume, switch to the next song, etc.

The preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes the text of incomplete instructions, the second type of sentence includes the text of complete instructions, and the third type The sentence includes the complete instruction text and the incremental text of N words except the text of the complete instruction, and the fourth type of sentence includes the complete instruction text and the incremental text of M words except the text of the complete instruction; N and M are positive Integer, M greater than N.

Among them, for the first type of sentence, the text of the complete instruction can be a semantically complete text, such as "turn on the air conditioner"; for the second type of sentence, the text of the incomplete instruction can be a semantically incomplete text, such as "open the air conditioner". "; for the third type of sentence sentence, the incremental text can be redundant text included in a sentence in addition to the semantically complete text, and N can be 1, such as "turn on the air conditioner"; for the fourth type of sentence sentence, M can be Take 2, for example, "turn on the air conditioner and turn it off".

The classification rules of the preset sentence types include at least classifying the first target speech text into the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence.

For example, when the first target speech text is "open the car window and turn off the air conditioner", continuous sentence sentences may be constructed for the first target speech text according to the classification rules of preset sentence types. Continuous sentences can include a plurality of test sentences whose preset sentence types are the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence is "open the car", and the second type of sentence is " Open the window", the third type of sentence is "open the window and close", and the fourth type of sentence is "open the window and close".

120. Obtain the prediction and classification results of consecutive sentence segments through the trained classification model.

The predicted classification results include the predicted probabilities corresponding to each test sentence in the preset sentence types.

The classification model can be a pre-trained language model in the field of natural language processing, namely BERT (Bidirectional Encoder Representations from Transformers) model, ELMo (Embedding from Language Models) model, or any of the ALBERT (ALITE BERT) models. Among them, ALBERT is an improved version of the BERT model, with far fewer parameters than the traditional BERT model structure, which improves the training speed and model performance. The improvement of ALBERT mainly lies in the factorization of embedding layer parameters, cross-layer parameter sharing mechanism, and inter-sentence continuity loss function.

The trained classification model is obtained by using a large number of training sentence texts for training. Continuous sentence input is input to the trained classification model, and the corresponding prediction probability of each test sentence in the continuous sentence in the preset sentence type can be output.

130. Determine the predicted sentence type corresponding to each test sentence according to the predicted classification result.

According to the prediction probabilities corresponding to each test sentence included in the prediction classification result and the probability thresholds corresponding to each preset sentence type, the predicted sentence type corresponding to each test sentence in each test sentence is determined.

For example, given a continuous three test sentences, "open the car window", "open the car window to close", "open the car window to close", obtain the corresponding predicted probability of each test sentence in the preset sentence type, if the model If the prediction effect is accurate, the prediction probability of the first test sentence belonging to the second type of sentence is the largest, the prediction probability of the second test sentence belonging to the third type of sentence is the largest, and the prediction probability of the third test sentence belonging to the fourth type of sentence is the largest . Whether the predicted classification result is credible can be determined by setting a threshold. For different preset sentence segmentation types, different thresholds can be set. For example, in the case that the preset sentence type corresponding to the test sentence is the second type of sentence, if the predicted probability that the test sentence belongs to the second type of sentence is greater than the threshold, the test sentence is confirmed to be the second type of sentence. If the predicted probabilities of the three test sentences belonging to the second type of sentence are 0.8, 0.5, and 0.4 respectively, when the threshold is set to 0.6, the test sentence with a predicted probability of 0.8 belongs to the second type of sentence.

140. Segment the first target speech text according to the predicted sentence type corresponding to each test sentence.

Construct multiple test sentences for the first target speech text according to the classification rules of the preset sentence types, so that the classification model can perform classification tasks on the test sentences, and a sentence composed of multiple characters can be used as a prediction unit instead of a single character As a unit of prediction, it greatly simplifies the structure of the model and reduces the prediction task of the model.

Please refer to FIG. 2 . FIG. 2 is a schematic flow chart of another voice sentence segmentation method disclosed in the present application, which can be applied to any of the aforementioned electronic devices. As shown in Figure 2, the method includes the following steps:

201. Acquire an initial speech text.

Naive speech text is unpreprocessed text data collected from user utterance commands. That is, it may be the text data obtained after directly converting the collected audio signal into corresponding text data.

202. Delete the initial sentence text in the initial speech text that is consistent with the common speech template from the initial speech text, and determine the initial speech text after the deletion of the initial sentence text as the first target speech text.

Commonly used voice templates can be multiple commonly used voice commands that match the current application scenario. For example, in a vehicle scenario, it can be "close the window", "turn on the air conditioner", etc., or a prefix word in the user's voice command, such as " Please", "I want" and so on. Based on the commonly used voice templates, the user's utterance commands can be prefixed and some action words can be adapted to achieve a preliminary semantic understanding and realize data preprocessing of the initial voice text. For example, the commonly used speech templates are "close the car window" and "turn on the air conditioner". If the prefix words or some action words of the initial speech text include "close the car window", which happens to be the same as the commonly used speech template, then directly execute "close the car window" , and delete the four words "close the car window" from the initial speech text, and determine the initial speech text after deleting these four words as the first target speech text.

Adapting the prefix of the user's utterance command and some action words based on the template can reduce the disturbance to the model. The phonetic sentence segmentation method of the present application can decompose the discourse sentence segmentation task into two sub-tasks. The first task is to predict whether the first few words in the initial phonetic text can form a semantically complete sentence, and the second task is based on the first objective. The prediction and classification results of the continuous sentence in the speech text are used to judge whether there is a sentence with complete semantics in the continuous sentence, and if there is, the sentence can be carried out.

203. Construct consecutive sentence sentences for the first target speech text according to the classification rules of preset sentence types.

204. Obtain the predicted classification result of the continuous sentence by using the trained classification model.

205. Determine the predicted sentence type corresponding to each test sentence in each test sentence according to the prediction probabilities corresponding to each test sentence included in the prediction classification result and the probability thresholds corresponding to each preset sentence type.

For the implementation manner of step 203-step 205, reference may be made to step 110-step 130 in the foregoing embodiment, and the following content will not be repeated.

206. If the test sentences whose predicted sentence type is the second type of sentence, the third type of sentence, and the fourth type of sentence appear continuously in each test sentence included in the continuous sentence, then determine the test sentence whose predicted sentence type is the second type of sentence is the first sentence text.

For example, if the three predicted sentence types of "open the car window", "open the car window and close" and "open the car window and close" appear continuously in the test sentence, they are the second type of sentence. In the case of the third type of sentence sentence and the fourth type of sentence sentence, it can be determined that "open the car window" whose predicted sentence type is the second type of sentence sentence is the first sentence sentence text.

Exemplarily, if the test sentence types of the first type of sentence, the second type of sentence, and the third type of test sentence appear continuously in the continuous sentence, then ignore the continuous test sentence and re-check whether there is a continuous prediction Sentence types are test sentences of the second type of sentence, the third type of sentence, and the fourth type of sentence.

207. If the first sentence segmentation text conforms to the business logic rule, determine the first sentence segmentation text as a sentence segmentation result of the first target speech text.

208. If the first sentence segmentation text does not comply with the business logic rule, ignore the first sentence segmentation text and do not perform sentence segmentation on the first target speech text.

After the first sentence text is obtained, it can be confirmed whether the current sentence is reasonable according to business logic rules. Exemplarily, the business logic rule may be that when the last character of the first segmented text is located before the word "and" in the first target phonetic text, the disconnection operation is not performed. For example, for "open the car window and air conditioner", if the first sentence text is "open the car window", that is to say, the first target speech text is disconnected between "open the car window" and "and air conditioner", then ignore the first Sentence text, do not perform sentence segmentation on the first target speech text.

209. Obtain a breakpoint mark corresponding to the sentence segmentation of the first target speech text, slide the sliding window to behind the breakpoint mark, and determine the text in the sliding window as the second target speech text.

The sliding window is the logical window in the sliding window algorithm, which generally acts on strings or arrays. By continuously sliding the window, the algorithm can be run within a window of a certain size. Before and after each sliding of the window, the content of the middle element does not change, only the beginning and end elements change. That is to say, the sum of the elements in the next window = the sum of the elements in the previous window - the value of the element leaving the window + the value of the element newly added to the window. During the window sliding process, it is necessary to delete elements that slide out of the window and add elements that slide into the window.

The second target phonetic text is the remaining text content after deleting the first sentence text from the first target phonetic text. After segmenting the first target phonetic text, after sliding the sliding window to the corresponding breakpoint mark, the first segmented text leaves the sliding window, and the starting point of the sliding window starts from the second target phonetic text.

Since the first target speech text may contain multiple user instructions, and the first sentence sentence text only corresponds to one of the user instructions, for other text contents in the first target speech text other than the first sentence sentence text, continue to perform steps 120-140. process until all user instructions contained in the first target voice text are executed.

210. Use the trained classification model to obtain a predicted classification result of the second target speech text, and use the predicted classification result of the second target speech text to segment the second target speech text.

The implementation of step 210 is similar to the aforementioned steps 203-209, and the following content will not be repeated.

Based on the effect of the classification model, the sliding window makes full use of the classification results of the classification model, and can eliminate the interference of the first sentence text on the subsequent sentence segmentation process, and focus on the sentence segmentation of the second target speech text. For example, if the first target voice text is "close the car window, turn on the air conditioner and turn on the ambient light", and the determined first sentence text is "close the car window", then the second target speech text is "turn on the air conditioner and turn on the ambient light".

The speech segmentation method of traditional sequence labeling uses a model to predict the category of each word in a short sentence as one of the beginning, middle, end, and irrelevant. The disadvantage is that the single model task is too heavy and needs to predict the category of each word.

The phonetic sentence segmentation method of the present application does not need to use a model to determine the category of each word in the first target phonetic text, but segments the first target phonetic text according to the predicted sentence type corresponding to each test sentence in the first target phonetic text. At the same time, the use of sliding windows can achieve more optimal breakpoint selection, obtain higher fault tolerance, and achieve higher precision, recall and sentence accuracy.

The following table 1 illustrates the comparison of the precision rate, recall rate and sentence accuracy rate between the sequence tagging speech segmentation method and the sliding window speech segmentation method.

the	精确率Accuracy	召回率recall rate	句子准确率Sentence accuracy
滑动窗口语音断句Sliding window speech segmentation	97.83％97.83%	93.65％93.65%	87.8％87.8%
序列标注语音断句2Sequence Annotation Speech Sentence 2	91.5％91.5%	89.1％89.1%	84％84%
序列标注语音断句1Sequence Annotation Speech Sentence 1	89.5％89.5%	90％90%	78％78%

Table 1 Example of test results of sliding window speech segmentation method and sequence annotation speech segmentation method

Please refer to FIG. 3 . FIG. 3 is a schematic flow chart of another speech segmentation method disclosed in the present application.

310. Obtain sample data.

The multiple training sentence segmentation texts included in the sample data are generated according to classification rules of preset sentence segmentation types.

The multiple training texts may be a certain number of commonly used instructions of users manually selected according to the characteristics of different application scenarios, which can be used to describe the needs of users in the application scenarios.

Optionally, if the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence, then a plurality of training sentence texts can also include sentences related to the first type of sentence, the second type of sentence Sentence sentences, sentence sentences of the third type, and sentence sentences of the fourth type correspond to the training sentence texts respectively.

In addition, each training sentence text included in the sample data may correspond to a real sentence type. The real sentence type may be manually marked, or it may be an accurate sentence type identified based on other classification methods, which is not specifically limited.

320. Select training sentence segmentation text from the sample data.

The method of selecting training sentence texts from the sample data may be random selection, and may be continuous sentence selections selected according to classification rules of preset sentence types, which is not specifically limited.

330. Input the training sentence segmentation text into the classification model to be trained, and obtain the training classification result of the training sentence segmentation text.

The training classification results include prediction probabilities corresponding to the training sentence texts in preset sentence types.

340. Determine the training sentence type of the training sentence text according to the training classification result of the training sentence text.

According to the prediction probabilities corresponding to the respective training sentence segmentation texts included in the training classification results, and the probability thresholds corresponding to the respective preset sentence segmentation types, the predicted sentence segmentation types corresponding to each training sentence segmentation text are determined.

350. Calculate the training loss according to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, and adjust the parameters in the classification model to be trained according to the training loss to obtain a trained classification model.

The calculated loss may be L1 loss, L2 loss, cross-entropy loss, etc., but is not limited thereto.

The method for adjusting model parameters may be gradient descent method, grid search method, random search method, Bayesian optimization method, etc., but is not limited thereto.

The aforementioned steps 310 to 350 may be a process of training the classification model. The predicted probability output by the classification model obtained after training is relatively accurate, and can be applied to scenarios such as vehicle-mounted speech recognition. The following steps 360-390 are performed to segment voice commands.

360. According to the classification rules of preset sentence types, construct continuous sentence sentences for the first target speech text.

370. Obtain the predicted classification result of the continuous sentence by using the trained classification model.

380. Determine the predicted sentence type corresponding to each test sentence according to the predicted classification result.

390. Segment the first target speech text according to the predicted sentence type corresponding to each test sentence.

Please refer to FIG. 4 . FIG. 4 is a schematic structural diagram of a speech punctuation device disclosed in the present application. The device can be applied to electronic devices such as vehicle terminals, and is not specifically limited. As shown in FIG. 4 , the speech segmentation device 400 may include: a construction module 410 , an acquisition module 420 , a determination module 430 , and a sentence segmentation module 640 .

The construction module 410 is used to construct continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and a plurality of test sentences included in the continuous sentences are intercepted from the first target speech text, and are different from the preset Text fragments corresponding to the sentence types;

The obtaining module 420 is used to obtain the predicted classification results of continuous sentences through the classification model trained; the predicted classification results include the corresponding predicted probabilities of each test sentence in the preset sentence type;

Determining module 430, is used for determining the predicted sentence type corresponding to each test sentence according to the prediction classification result;

The sentence segmentation module 440 is configured to segment the first target speech text according to the predicted sentence type corresponding to each test sentence.

In one embodiment, the determination module 430 can be used to determine the corresponding prediction probability of each test sentence in each test sentence according to the prediction probability corresponding to each test sentence included in the prediction classification result, and the probability threshold corresponding to each preset sentence type. Predicted sentence type.

In one embodiment, the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes the text of the incomplete instruction, and the second type of sentence includes the complete instruction The third type of sentence includes the complete instruction text and the incremental text of N words other than the text of the complete instruction, and the fourth type of sentence includes the complete instruction text and the incremental text of M words other than the text of the complete instruction ; N and M are positive integers, and M is greater than N;

The multiple test sentences included in the continuous sentence at least correspond to the first type of sentence, the second type of sentence, the third type of sentence and the fourth type of sentence respectively.

In one embodiment, the sentence segmentation module 440 further includes a determination unit and a sentence segmentation unit.

The determining unit may be used for if the test sentences whose predicted sentence types are the second type of sentence, the third type of sentence, and the fourth type of sentence appear continuously in each test sentence included in the continuous sentence, then the predicted sentence type The test sentence that is the second type of sentence is determined as the first sentence text.

The sentence segmentation unit is configured to segment the first target speech text according to the first sentence segmentation text.

In one embodiment, the sentence segmentation unit can also be used to determine the first sentence sentence text as the sentence segmentation result of the first target speech text if the first sentence sentence text conforms to the business logic rule; if the If the first sentence segmentation text does not conform to the business logic rule, the first sentence segmentation text is ignored, and no sentence segmentation is performed on the first target speech text.

In one embodiment, the speech sentence segmentation device further includes a sliding module, configured to obtain a breakpoint mark corresponding to the sentence segmentation of the first target speech text, and slide the sliding window behind the breakpoint mark, and The text in the sliding window is determined as the second target phonetic text; the classification model using the trained classification model is used to obtain the predicted classification result of the second target phonetic text, and use the predicted classification result of the second target phonetic text to The second target speech text is segmented.

In one embodiment, the phonetic sentence segmentation device also includes a preprocessing module for obtaining an initial phonetic text; deleting the initial phonetic text in the initial phonetic text that is consistent with a commonly used phonetic template from the initial phonetic text, and The initial speech text after deleting the initial sentence sentence text is determined as the first target speech text.

In one embodiment, the speech sentence segmentation device can also be used in the model training device 500 . Please refer to FIG. 5 . FIG. 5 is a schematic structural diagram of a model training device disclosed in the present application. The model training device can be applied to electronic devices with strong computing capabilities such as servers and computers; or, the model training device can also be applied to terminal devices with weak computing capabilities such as vehicle terminals, which are not specifically limited. The model training device 500 may include: an acquisition module 510 , a selection module 520 , a training module 530 , a determination module 540 , and an adjustment module 550 .

An acquisition module 510, configured to acquire sample data; a plurality of training sentence sentence texts included in the sample data are generated according to classification rules of preset sentence type;

The selection module 520 is used to select the training sentence sentence text from the sample data;

The training module 530 is used to input the training sentence sentence text into the classification model to be trained to obtain the training classification result of the training sentence sentence text; the training classification result includes each sample sentence sentence text in the training sentence sentence text corresponding to the preset sentence type predicted probability;

Determining module 540 is used to determine the training sentence type of the training sentence text according to the training classification result of the training sentence text;

The adjustment module 550 is used to calculate the training loss according to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, and adjust the parameters in the classification model to be trained according to the training loss , to get the trained classification model.

Please refer to FIG. 6 . FIG. 6 is a schematic structural diagram of an electronic device disclosed in the present application. As shown in FIG. 6, the electronic device 600 may include:

a memory 610 storing executable program code;

a processor 620 coupled to the memory 610;

Wherein, the processor 620 invokes the executable program code stored in the memory 610 to execute any one of the voice sentence segmentation methods disclosed in this application.

The present application discloses a computer-readable storage medium, which stores a computer program, wherein, when the computer program is executed by the processor, the processor is made to implement any one of the speech segmentation methods disclosed in the present application.

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by this application.

In various embodiments of the present application, it should be understood that the sequence numbers of the above-mentioned processes do not necessarily mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the requirements of the present application. The implementation process constitutes no limitation.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, located in one place, or distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a memory , including several requests to make a computer device (which may be a personal computer, server, or network device, etc., specifically, a processor in the computer device) execute some or all of the steps of the above-mentioned methods in various embodiments of the present application.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium includes read-only Memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), programmable read-only memory (Programmable Read-only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), One-time Programmable Read-Only Memory (OTPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc (Compact Disc Read-Only Memory, CD-ROM) or other optical disk storage, magnetic disk storage, tape storage, or any other computer-readable medium that can be used to carry or store data.

Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A method for speech punctuation, characterized in that the method comprises:

According to the classification rules of the preset sentence types, a continuous sentence is constructed for the first target speech text, and the plurality of test sentence sentences included in the continuous sentence are intercepted from the first target speech text and correspond to different preset sentence types respectively the text fragment of

Obtaining the predicted classification results of the continuous sentences through the trained classification model; the predicted classification results include the corresponding predicted probabilities of each of the test sentences in the preset sentence types;

Determining the predicted sentence types corresponding to each of the test sentences according to the predicted classification results;

The first target speech text is segmented according to the predicted sentence type corresponding to each test sentence.
The method according to claim 1, wherein said determining the predicted sentence types respectively corresponding to each said test sentence according to said predicted classification results comprises:

According to the predicted probabilities corresponding to the respective test sentences included in the prediction and classification results, and the probability thresholds corresponding to the respective preset sentence types, the predicted sentence type corresponding to each test sentence in the various test sentences is determined.
The method according to claim 1, wherein the preset sentence types include at least the first type of sentence, the second type of sentence, the third type of sentence, and the fourth type of sentence; the first type of sentence includes incomplete sentences. The text of the instruction, the second type of sentence includes the text of the complete instruction, the third type of sentence includes the complete instruction text and the incremental text of N characters other than the text of the complete instruction, and the fourth type of sentence includes the complete instruction text The instruction text and the incremental text of M words except the text of the complete instruction; the N and the M are positive integers, and the M is greater than the N;

The multiple test sentences included in the continuous sentence at least correspond to the first type of sentence, the second type of sentence, the third type of sentence and the fourth type of sentence respectively.
The method according to claim 1, wherein said segmenting said first target speech text according to the predicted segment type corresponding to each said test segment comprises:

If in each test sentence that the continuous sentence includes, the predicted sentence type is the second type of sentence, the third type of sentence, and the fourth type of test sentence, and then the predicted sentence type is the second type of sentence. The test sentence of is determined as the first sentence text;

Segmenting the first target phonetic text according to the first sentence-segmenting text.
The method according to claim 4, wherein said segregating said first target phonetic text according to said first sentence-segmenting text comprises:

If the first sentence-sentence text conforms to the business logic rule, then determining the first sentence-sentence text as the sentence-segmentation result of the first target speech text;

If the first sentence segmentation text does not conform to the business logic rule, then ignore the first sentence segmentation text, and do not perform sentence segmentation on the first target speech text.
The method according to any one of claims 1-5, wherein after the first target speech text is segmented according to the predicted sentence type corresponding to each of the test sentences, the method further comprises:

Obtain the corresponding breakpoint marker after the first target speech text is segmented, and slide the sliding window after the breakpoint marker, and determine the text in the sliding window as the second target speech text;

Using the trained classification model to obtain a predicted classification result of the second target speech text, and using the predicted classification result of the second target speech text to segment the second target speech text.
The method according to any one of claims 1-5, wherein, before the classification model completed by training is used to classify and predict the first target speech text, the method further comprises:

Get the initial speech text;

Deleting the initial sentence text in the initial speech text that is consistent with the common speech template from the initial speech text, and determining the initial speech text after the deletion of the initial sentence text as the first target speech text.
The method according to claim 1, characterized in that, before the classification rules of the preset sentence type are used to construct continuous sentence sentences for the first target phonetic text, the method comprises:

Obtain sample data; a plurality of training sentence sentence texts included in the sample data are generated according to classification rules of preset sentence type;

Select the training sentence sentence text from the sample data;

The training sentence sentence text is input into the classification model to be trained to obtain the training classification result of the training sentence sentence text; the training classification result includes each sample sentence sentence text in the training sentence sentence text in the preset sentence sentence type The corresponding predicted probability;

Determine the training sentence type of the training sentence text according to the training classification result of the training sentence text;

According to the training sentence type of the training sentence text and the real sentence type corresponding to the training sentence text, the training loss is calculated, and the parameters in the classification model to be trained are adjusted according to the training loss, so as to obtain the training completion. classification model.
A device for speech punctuation, characterized in that the device comprises:

The construction module is used for constructing continuous sentences for the first target speech text according to the classification rules of the preset sentence types, and the plurality of test sentences included in the continuous sentences are intercepted from the first target speech text and different preset Text fragments corresponding to the sentence types;

An acquisition module, configured to acquire the predicted classification result of the test sentence through the trained classification model; the predicted classification result includes the predicted probability corresponding to the test sentence in the preset sentence type;

A determining module, configured to determine a predicted sentence type corresponding to the test sentence according to the predicted classification result;

The sentence segmentation module is configured to segment the first target speech text according to the predicted sentence type corresponding to each of the test sentences.
[Corrected 17.01.2023 under Rule 26]
An electronic device, characterized in that it includes a memory and a processor, and a computer program is stored in the memory, and when the computer program is executed by the processor, the processor is made to realize the requirements of claims 1-7 or 8. any one of the methods described.
[Corrected 17.01.2023 under Rule 26]
A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1-7 or 8 when executed by a processor.