WO2023137920A1 - 语义截断检测方法、装置、设备和计算机可读存储介质 - Google Patents

语义截断检测方法、装置、设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2023137920A1
WO2023137920A1 PCT/CN2022/090745 CN2022090745W WO2023137920A1 WO 2023137920 A1 WO2023137920 A1 WO 2023137920A1 CN 2022090745 W CN2022090745 W CN 2022090745W WO 2023137920 A1 WO2023137920 A1 WO 2023137920A1
Authority
WO
WIPO (PCT)
Prior art keywords
truncation
text data
detected
semantic
data
Prior art date
Application number
PCT/CN2022/090745
Other languages
English (en)
French (fr)
Inventor
赵仕豪
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023137920A1 publication Critical patent/WO2023137920A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the fields of artificial intelligence technology and natural language processing, and in particular to a semantic truncation detection method, device, equipment and computer-readable storage medium.
  • the following is the technical problem of the prior art that the inventor is aware of:
  • the general interaction process is that the user finishes speaking, and then the intelligent customer service robot recognizes and provides corresponding services after receiving the voice information of the user. Bring a poor sense of experience.
  • the waiting time of the customer service robot is set to be extended, the time for the user to wait for the feedback from the customer service robot will also increase accordingly, which will also bring a poor experience to the user and reduce user satisfaction.
  • the embodiment of the present application provides a semantic truncation detection method, including:
  • first corpus data Obtaining first corpus data, and obtaining multiple semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data in which semantic truncation occurs;
  • the text data to be detected is detected by preset rules and/or a pre-trained BERT classification model, and a detection result of whether semantic truncation occurs in the text data to be detected is obtained;
  • the BERT classification model is obtained through the following training steps:
  • the business corpus data includes a plurality of business text data
  • the embodiment of the present application also provides a semantic truncation detection device, including:
  • the first obtaining module is used to obtain the text data to be detected
  • the second acquisition module is used to acquire the first corpus data, and obtain multiple semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data in which semantic truncation occurs;
  • a judging module configured to judge the semantic truncation type to which the text data to be detected belongs
  • a detection module configured to detect the text data to be detected through preset rules and/or a pre-trained BERT classification model according to the semantic truncation type, and obtain a detection result of whether semantic truncation occurs in the text data to be detected;
  • the third acquiring module is used to acquire business corpus data, wherein the business corpus data includes multiple pieces of business text data;
  • a positive example construction module which is used to select a random position for each piece of business text data to segment, and construct a positive example sentence pair, wherein the positive example sentence pair is a context sentence with a truncation relationship;
  • the negative example construction module is used to select any two pieces of the business text data and construct a negative example sentence pair, wherein the negative example sentence pair is a non-truncated upper and lower sentence;
  • a training module configured to construct a training set according to the positive example sentence pair and the negative example sentence, and input the training set into the initial BERT model for training to obtain the BERT classification model.
  • an embodiment of the present application further provides a computer device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • a semantic truncation detection method is implemented, wherein the semantic truncation detection method includes: obtaining text data to be detected; obtaining first corpus data, and obtaining multiple semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data in which semantic truncation occurs; judging the semantic truncation type to which the text data to be detected belongs; Detecting the text data to be detected by preset rules and/or a pre-trained BERT classification model to obtain a detection result of whether semantic truncation occurs in the text data to be detected; wherein the BERT classification model is obtained through the following training steps: obtaining business corpus data, wherein the business corpus data includes a plurality of business text data; selecting a random position for each piece of business text data
  • an embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions for executing a semantic truncation detection method, wherein the semantic truncation detection method includes: acquiring text data to be detected; acquiring first corpus data, and obtaining multiple semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data in which semantic truncation occurs; judging the semantic truncation type to which the text data to be detected belongs; The classification model detects the text data to be detected, and obtains the detection result of whether semantic truncation occurs in the text data to be detected; wherein, the BERT classification model is obtained through the following training steps: obtaining business corpus data, wherein the business corpus data includes a plurality of business text data; selecting a random position for each piece of business text data to be segmented, and constructing a positive example sentence pair, wherein the positive example sentence pair is an upper and lower sentence with a trunc
  • the semantic truncation detection method, device, device, and computer-readable storage medium proposed in the embodiments of the present application obtain the text data to be detected, determine the semantic truncation type of the text data to be detected, detect the text data to be detected according to preset rules and/or pre-trained BERT classification models, and select different methods to detect the text data to be detected based on different semantic truncation types, so as to provide interactive services for users in a more targeted manner, which is conducive to improving the responsiveness of the interaction process.
  • the upper and lower sentences with truncated relationship are constructed in the data as positive sentence pairs, and the upper and lower sentences with non-truncated relationship are constructed as negative example sentence pairs.
  • Training the model based on the training set constructed by positive example sentence pairs and negative example sentence pairs can make the model better learn the task of truncating features, which is conducive to improving the recognition performance of the model, so that the customer service robot can more accurately identify the user's intention when faced with various complex actual interaction situations, reduce the number of interactions between the user and the customer service robot due to recognition failure, and effectively improve service quality and user satisfaction.
  • Fig. 1 is a flowchart of a semantic truncation detection method provided by an embodiment of the present application
  • Fig. 2 is the flowchart of the BERT classification model training method that an embodiment of the present application provides;
  • FIG. 3 is a flowchart of a semantic truncation detection method provided by another embodiment of the present application.
  • FIG. 4 is a flowchart of a semantic truncation detection method provided by another embodiment of the present application.
  • FIG. 5 is a flowchart of a semantic truncation detection method provided by another embodiment of the present application.
  • FIG. 6 is a flowchart of a semantic truncation detection method provided by another embodiment of the present application.
  • Fig. 7 is the flowchart of the BERT classification model training method that another embodiment of the present application provides.
  • FIG. 8 is a schematic structural diagram of a semantic truncation detection device provided by another embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Figure 1 is a flowchart of a semantic truncation detection method provided by an embodiment of the present application, which includes but is not limited to steps S110 to S140:
  • Step S110 Obtain text data to be detected
  • the text data to be detected is converted from the user's voice data collected by an artificial intelligence-based voice device.
  • the voice device collects the voice data output by the user during the interaction process, then recognizes and converts the voice data, and generates corresponding text data, that is, the text data to be detected is obtained.
  • the voice device can be an electronic device that supports voice interaction functions such as smart phones, smart appliances, and smart watches.
  • the voice device also has the function of audio output, so as to realize human-machine voice interaction and meet the interactive use needs of users.
  • Step S120 Acquire the first corpus data, and obtain multiple semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data with semantic truncation;
  • the first corpus data is obtained by counting historical text data with semantic truncation in business applications, and based on the first corpus data, several common semantic truncation types are analyzed, that is, the types of sentences that are prone to truncation during the interaction process are analyzed, and multiple semantic truncation types are obtained.
  • Step S130 judging the semantic truncation type to which the text data to be detected belongs
  • Step S140 According to the type of semantic truncation, the text data to be detected is detected through the preset rules and/or the pre-trained BERT classification model, and the detection result of whether the text data to be detected has semantic truncation is obtained;
  • the text data to be detected After obtaining the text data to be detected from the user, compare the text data to be detected with multiple semantic truncation types to determine the truncation type that the text data to be detected is more suitable for, that is, to obtain the semantic truncation type to which the text data to be detected belongs. Based on different semantic truncation types, different detection methods can be selected to provide interactive services for users in a more targeted manner. Whether there is semantic truncation in the data, it is convenient for the customer service robot to identify the user's intention, effectively improve the user's interactive experience, and then reduce the service demand of artificial agents, which can improve the work efficiency of the call service center and reduce operating costs to a certain extent.
  • the preset rules are used to identify whether the user ends the current dialogue during the interaction process, that is, whether semantic truncation occurs.
  • the preset rules usually match the query words based on the established language database.
  • the database used can contain common semantic truncation sentences, which is convenient for judging whether the user's output text data has semantic truncation.
  • the preset rules include a variety of matching methods, for example, method 1: head query word matching, using the exact text matching method for the small but relatively concentrated query words in the truncated sentence; method 2: special query word matching, using regular matching for some special format query words;
  • the BERT classification model is obtained through the following training steps:
  • Step S210 Obtain business corpus data, wherein the business corpus data includes multiple pieces of business text data;
  • Step S220 select a random position for each piece of business text data to segment, and construct positive example sentence pairs, wherein the positive example sentence pairs are upper and lower sentences with a truncation relationship;
  • Step S230 Select any two pieces of business text data to construct a negative example sentence pair, wherein the negative example sentence pair is a non-truncated upper and lower sentence;
  • Step S240 Construct a training set according to the positive example sentence pair and the negative example sentence pair, input the training set into the initial BERT model for training, and obtain the BERT classification model.
  • the BERT model is selected as the classification model.
  • the model structure adopts the standard base version of BERT, that is, 12-layer, 768-hidden, 12-heads, and 110M parameters.
  • the initial BERT model can learn the truncation relationship between sentences.
  • the business corpus data includes multiple business text data, and the pre-training task is designed according to the characteristics of text truncation.
  • two pieces of business text data are randomly selected to construct upper and lower sentences with non-truncated relationships, that is, negative example sentence pairs.
  • a training set is constructed based on positive example sentence pairs and negative example sentence pairs, and the training set is input into the initial BERT model for training, so that the model can predict the truncated relationship between the upper and lower sentences of the text data while pre-training, and finally obtain a BERT classification model with better text representation effect, which is conducive to the accuracy of detecting semantic truncation in the text data, so that the customer service robot has a stronger ability to judge whether the user ends the current conversation.
  • the technical solution of the embodiment of the present application by obtaining the text data to be detected, determine the semantic truncation type of the text data to be detected, and detect the text data to be detected according to the preset rules and/or the pre-trained BERT classification model, and select different methods to detect the text data to be detected based on different semantic truncation types, so as to provide interactive services for users in a more targeted manner, which is conducive to improving the responsiveness of the interaction process.
  • Sentence pairs as well as constructing upper and lower sentences with non-truncated relationships as negative example sentence pairs, and training the model based on the training set constructed by positive example sentence pairs and negative example sentence pairs, can enable the model to better learn the task of truncating features, which is conducive to improving the recognition performance of the model, so that the customer service robot can more accurately identify the user's intention when faced with various complex actual interaction situations, reduce the number of interactions between the user and the customer service robot due to recognition failure, effectively improve service quality, and increase user satisfaction.
  • a plurality of semantic truncation types include the first truncation type, the second truncation type and the third truncation type, and the preset rules include the first matching dictionary, the second matching dictionary and the third matching dictionary, and in step S140, according to the semantic truncation type, the text data to be detected is detected by preset rules and/or pre-trained BERT classification models, including at least one of the following:
  • Step S1411 If the text data to be detected belongs to the first truncation type, match the text data to be detected according to the first matching dictionary, wherein the first truncation type indicates the presence of modal particles;
  • Step S1421 If the text data to be detected belongs to the second truncation type, detect the text data to be detected according to the second matching dictionary and the BERT classification model, wherein the second truncation type indicates that there is a pause or an interrupted vocabulary;
  • Step S1431 If the text data to be detected belongs to the third truncation type, detect the text data to be detected according to the third matching dictionary and the BERT classification model, wherein the third truncation type indicates the occurrence of customary spoken words.
  • the embodiment of the present application summarizes three types of semantic truncation that are likely to cause semantic truncation by counting business text data in various application scenarios.
  • the first truncation type indicates the occurrence of modal particles, such as words such as "ah”, "" and "um”.
  • the second truncation type indicates pauses or interrupted vocabulary. This type is usually semantic truncation caused by pauses in thinking or interruptions when the user expresses, for example, words such as "consult”, “want to check", and "excuse me”.
  • the third truncation type indicates the occurrence of colloquial idiomatic vocabulary.
  • This type is usually semantic truncation due to the occurrence of colloquial idiomatic vocabulary when the user expresses, for example, words such as "this", “that”, and “is”.
  • the semantic truncation detection method of the embodiment of the present application performs the following steps:
  • Step S110 Obtain text data to be detected
  • Step S120 Obtain the first corpus data, and obtain multiple semantic truncation types according to the first corpus data;
  • Step S130 judging the semantic truncation type to which the text data to be detected belongs
  • Step S141 If the text data to be detected belongs to the first truncation type, match the text data to be detected according to the first matching dictionary, and obtain the detection result of whether the text data to be detected has semantic truncation;
  • Step S142 If the text data to be detected belongs to the second truncation type, detect the text data to be detected according to the second matching dictionary and the BERT classification model, and obtain the detection result of whether the text data to be detected has semantic truncation;
  • Step S143 If the text data to be detected belongs to the third truncation type, detect the text data to be detected according to the third matching dictionary and the BERT classification model, and obtain the detection result of whether the text data to be detected has semantic truncation.
  • step S140 obtains the detection result of whether the semantic truncation occurs in the text data to be detected, including:
  • Step S1412 If the text data to be detected matches the modal particle in the first matching dictionary, obtain the detection result that the text data to be detected has semantic truncation.
  • the preset rules can be used for matching.
  • the preset rules are set with a first matching dictionary.
  • the first matching dictionary includes a plurality of typical modal particles.
  • the text data to be detected is matched through the first matching dictionary. If the relevant modal particles can be accurately matched in the text data to be detected, semantic truncation will be detected, that is, the obtained detection result is semantic truncation of the text data to be detected.
  • the matching of the text data to be detected by the first matching dictionary may adopt a text exact matching method or a part-of-speech sequence matching method.
  • the second matching dictionary pre-stores a plurality of pause vocabulary and interrupt vocabulary; in step S1421, the text data to be detected is detected according to the second matching dictionary and the BERT classification model, including but not limited to step S310 and step S320:
  • Step S310 matching the beginning and end of the text data to be detected according to the second matching dictionary
  • Step S320 If the text data to be detected cannot be matched to the vocabulary in the second matching dictionary, the BERT classification model is used to detect and output the probability prediction score, wherein the probability prediction score includes truncated prediction scores and non-truncated prediction scores;
  • step S140 the detection result of whether semantic truncation occurs in the text data to be detected includes:
  • Step S1422 If the truncation prediction score is higher than or equal to the preset truncation threshold, obtain a detection result that semantic truncation occurs in the text data to be detected.
  • a combination of preset rules and the BERT classification model can be used to count the pause words and interruption words that appear frequently in some truncated sentences, which are pre-stored in the second matching dictionary. Since the pause words and interruption words are common at the beginning and end of the sentence, in practical applications, the beginning and end of the text data to be detected are first matched through the second matching dictionary. If the words in the second matching dictionary cannot be matched, the BERT classification model is further used for detection. For the probability prediction score of the category, a threshold judgment mechanism is designed, a preset truncation threshold is introduced, and the detection result is output by comparing the truncation prediction score with the preset truncation threshold.
  • the detection result indicates semantic truncation. It is understandable that if the truncation prediction score is lower than the preset truncation threshold, the detection result indicates that the text data to be detected is not truncated.
  • the recognition performance of the BERT classification model can be effectively improved, so that it can more accurately judge whether the user ends the current conversation, so that the user's intention can be quickly and accurately identified.
  • the preset truncation threshold can be set according to the actual situation. In the embodiment of the present application, by testing different thresholds, the preset truncation threshold is set to 0.6, and the detection effect of the BERT classification model is the best.
  • the text data to be detected can be matched to the vocabulary in the second matching dictionary, it can be directly determined that the text data to be detected has semantic truncation.
  • the text data to be detected is matched through the second matching dictionary, and special query word matching or short sentence query word matching can be selected.
  • the third matching dictionary pre-stores a plurality of spoken idiomatic vocabulary; in step S1431, the text data to be detected is detected according to the third matching dictionary and the BERT classification model, including but not limited to step S410 and step S420:
  • Step S410 Match the end of the text data to be detected according to the third matching dictionary
  • Step S420 If the text data to be detected cannot be matched to the vocabulary in the third matching dictionary, the BERT classification model is used to detect and output the probability prediction score, wherein the probability prediction score includes truncated prediction scores and non-truncated prediction scores;
  • step S140 the detection result of whether semantic truncation occurs in the text data to be detected includes:
  • Step S1422 If the truncation prediction score is higher than or equal to the preset truncation threshold, obtain the detection result that the tone truncation occurs in the text data to be detected.
  • the third matching dictionary is established by counting the frequently occurring colloquial vocabulary in the truncated sentence. Since the colloquial customary vocabulary is often found at the end of the sentence, in practical applications, the end of the text data to be detected is first matched through the third matching dictionary. If the vocabulary in the third matching dictionary cannot be matched, the BERT classification model is further used for detection. The probability prediction score of , by comparing the truncation prediction score with the preset truncation threshold, outputs the detection result. If the truncation prediction score is higher than or equal to the preset truncation threshold, the detection result indicates that semantic truncation occurs.
  • the text data to be detected can be matched to the vocabulary in the third matching dictionary, it can be directly determined that the text data to be detected has semantic truncation.
  • the text data to be detected is matched through the third matching dictionary, using exact text matching and regular matching, wherein the third matching dictionary includes exact matching dictionaries with vocabulary and special format matching dictionaries.
  • the first corpus data is obtained in step S120, and multiple semantic truncation types are obtained according to the first corpus data, including but not limited to steps S510 to step S530:
  • Step S510 Obtain pre-marked first corpus data
  • Step S520 Perform preprocessing and word segmentation processing on the first corpus data to obtain the second corpus data;
  • Step S530 Obtain multiple semantic truncation types according to the preset semantic dimension and the second corpus data, wherein the preset semantic dimension includes at least one of sentence length, first and last words, sentence structure, part-of-speech order, and frequency distribution.
  • Data is an important prerequisite for analysis, and the accumulation of original corpus data is the first work that needs to be done.
  • mark the data With semantic truncation.
  • select one month’ s business data, and compare the results of phonetic-to-character recognition and manual translation to mark the data to obtain the first corpus data, and then preprocess and segment the first corpus data to obtain the second corpus data.
  • a word segmentation dictionary based on business data is continuously optimized, which is more suitable for business application scenarios.
  • the preset semantic dimensions include at least one of sentence length, first and last words, sentence structure, part-of-speech order, and frequency distribution.
  • semantic truncations While both are semantic truncations, the types are different. The former is mostly the user’s subjective pause, and the sentence expression is mostly the subject plus a verb, while the latter mostly appears in the form of some modal particles without other content.
  • three types of semantic truncation are obtained, which are respectively the first truncation type, the second truncation type and the third truncation type, wherein the first truncation type indicates the occurrence of modal particles, the second truncation type indicates the occurrence of pauses or interrupted vocabulary, and the third truncation type indicates the occurrence of oral habitual vocabulary.
  • the BERT classification model includes a fully connected layer and two Transformer layers.
  • the training set is input into the initial BERT model for training, including but not limited to steps S610 to S640:
  • Step S610 input the data in the training set to the Transformer layer in the initial BERT model
  • Step S620 input the output vector of the last Transformer layer to the fully connected layer, and output the probability prediction scores of two categories, wherein the probability prediction scores include truncated prediction scores and non-truncated prediction scores;
  • Step S630 if the truncation prediction score is higher than or equal to the preset truncation threshold, output a prediction result indicating that semantic truncation occurs;
  • Step S640 Train the initial BERT model according to the training set and prediction results.
  • the embodiment of the present application modifies part of the Transformer unit in the middle layer of the BERT classification model, and reduces the twelve-layer Transformer structure of the initial BERT model to a two-layer Transformer structure, which greatly simplifies the model structure without greatly affecting the model performance, and correspondingly reduces the number of model parameters.
  • the training speed of the entire model has increased by three times.
  • the streamlined model structure can greatly improve the training speed of the model and the prediction speed of the model, which is conducive to meeting the higher requirements of enterprises for rapid iteration and responsiveness of business models.
  • the detailed process of inputting the training set into the initial BERT model for training is to input the preprocessed data in the training set into the initial BERT model.
  • the data is represented by the text through the Embedding layer (embedded layer), and then sent to the Transformer layer.
  • the output vector of the hidden state of the last Transformer layer is input to the fully connected layer.
  • the prediction score and the preset truncation threshold are used to output the prediction result. If the truncation prediction score is higher than or equal to the preset truncation threshold, the prediction result indicates that semantic truncation occurs.
  • the initial BERT model is trained according to the training set and the prediction result to obtain a BERT classification model with good recognition performance.
  • the preset cut-off threshold of the BERT classification model in the training process in this embodiment is the same value as the preset cut-off threshold in the detection process described above. According to the results of multiple tests, the preset cut-off threshold can be set to 0.6.
  • semantic truncation detection method it also includes:
  • the response operation is directly performed.
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • the semantic truncation detection method of the embodiment of the present application can be applied to an intelligent customer service system.
  • the user's voice response is usually converted into a response text, and the response text is input into the man-machine dialogue system for recognition.
  • the customer service robot provides voice interactive services for the user, such as after-sales problem consultation, operation guidance service, etc.
  • FIG. 8 is a schematic structural diagram of a semantic truncation detection device provided by an embodiment of the present application.
  • the semantic truncation detection device 800 of the embodiment of the present application includes but is not limited to a first acquisition module 810, a second acquisition module 820, a judgment module 830, a detection module 840, a third acquisition module 850, a positive example construction module 860, a negative example construction module 870, and a training module 880.
  • the first acquisition module 810 is used to acquire the text data to be detected;
  • the second acquisition module 820 is used to acquire the first corpus data, and obtain a plurality of semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data with semantic truncation;
  • the judgment module 830 is used to judge the semantic truncation type to which the text data to be detected belongs;
  • the detection module 840 is used to detect the text data to be detected through preset rules and/or a pre-trained BERT classification model according to the semantic truncation type, and obtain the detection result of whether the semantic truncation occurs in the text data to be detected;
  • Acquisition module 850 is used for obtaining business corpus data, and wherein, business corpus data comprises a plurality of business text data;
  • Positive example construction module 860 is used for selecting a random position to segment each business text data, constructs positive example sentence pair, wherein, positive example sentence pair is the upper and lower sentences of trun
  • the technical solution of the embodiment of the present application by obtaining the text data to be detected, determine the semantic truncation type of the text data to be detected, and detect the text data to be detected according to the preset rules and/or the pre-trained BERT classification model, and select different methods to detect the text data to be detected based on different semantic truncation types, so as to provide interactive services for users in a more targeted manner, which is conducive to improving the responsiveness of the interaction process.
  • Sentence pairs as well as constructing upper and lower sentences with non-truncated relationships as negative example sentence pairs, and training the model based on the training set constructed by positive example sentence pairs and negative example sentence pairs, can enable the model to better learn the task of truncating features, which is conducive to improving the recognition performance of the model, so that the customer service robot can more accurately identify the user's intention when faced with various complex actual interaction situations, reduce the number of interactions between the user and the customer service robot due to recognition failure, effectively improve service quality, and increase user satisfaction.
  • the multiple semantic truncation types include the first truncation type, the second truncation type and the third truncation type
  • the preset rules include the first matching dictionary, the second matching dictionary and the third matching dictionary.
  • the text data to be detected is detected through preset rules and/or a pre-trained BERT classification model, including at least one of the following:
  • the text data to be detected belongs to the first truncation type
  • the text data to be detected is matched according to the first matching dictionary, wherein the first truncation type indicates the occurrence of modal particles
  • the text data to be detected belongs to the second truncation type
  • the text data to be detected is detected according to the second matching dictionary and the BERT classification model, wherein the second truncation type represents a pause or an interruption vocabulary
  • the text data to be detected belongs to the third truncation type
  • the text data to be detected is detected according to the third matching dictionary and the BERT classification model, wherein the third truncation type indicates the occurrence of customary spoken words.
  • a plurality of modal particles are pre-stored in the first matching dictionary; whether the detection module obtains the detection result of semantic truncation in the text data to be detected, specifically includes:
  • the second matching dictionary pre-stores a plurality of pause words and interrupt words
  • the detection module detects the text data to be detected according to the second matching dictionary and the BERT classification model, specifically including:
  • the BERT classification model is used to detect and output the probability prediction score, wherein the probability prediction score includes truncated prediction scores and non-truncated prediction scores.
  • the detection module obtains the detection result of whether the text data to be detected has semantic truncation, including:
  • a detection result indicating that semantic truncation occurs in the text data to be detected is obtained.
  • the third matching dictionary pre-stores a plurality of oral habitual vocabulary; in the detection module, the text data to be detected is detected according to the third matching dictionary and the BERT classification model, specifically including:
  • the BERT classification model is used to detect and output the probability prediction score, wherein the probability prediction score includes truncated prediction scores and non-truncated prediction scores.
  • the detection module obtains the detection result of whether the text data to be detected has semantic truncation, including:
  • the detection result of tone truncation in the text data to be detected is obtained.
  • the second acquisition module is specifically used for:
  • Preprocessing and word segmentation processing are performed on the first corpus data to obtain the second corpus data;
  • the preset semantic dimension and the second corpus data multiple semantic truncation types are obtained, wherein the preset semantic dimension includes at least one of sentence length, first and last words, sentence structure, part-of-speech order, and frequency distribution.
  • the BERT classification model includes a fully connected layer and two Transformer layers.
  • the training set is input into the initial BERT model for training, specifically including:
  • the output indicates the prediction result of semantic truncation
  • semantic truncation detection device also includes a first execution module and a second execution module, the first execution module is used to wait for the first preset time to execute the response operation when the detection result indicates that the text data to be detected has semantic truncation, and the second execution module is used to directly execute the response operation when the detection result indicates that the text data to be detected does not have semantic truncation.
  • an embodiment of the present application also provides a computer device 900 .
  • the computer device 900 includes: a memory 910 , a processor 920 and a computer program stored in the memory 910 and operable on the processor 920 .
  • the processor 920 and the memory 910 may be connected through a bus or in other ways.
  • the memory 910 as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory 910 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 910 may optionally include memory located remotely from the processor 920, and these remote memories may be connected to the signaler component via a network.
  • Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the computer device 900 shown in FIG. 9 does not limit the embodiment of the present application, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • the non-transient software programs and instructions required to implement the semantic truncation detection method of the above-mentioned embodiments are stored in the memory 910, and when executed by the processor 920, a semantic truncation detection method is executed, wherein the semantic truncation detection method includes: obtaining text data to be detected; obtaining first corpus data, and obtaining multiple semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data that has semantic truncation; judging the semantic truncation type to which the text data to be detected belongs;
  • the BERT classification model detects the text data to be detected, and obtains the detection result of whether the text data to be detected has semantic truncation; among them, the BERT classification model is obtained through the following training steps: obtain business corpus data, wherein the business corpus data includes multiple business text data; select a random position for each business text data to be segmented, and construct a positive example sentence pair, in which the positive example sentence pair is
  • an embodiment of the present application also provides a computer-readable storage medium
  • the computer-readable storage medium may be non-volatile or volatile
  • the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to execute the above semantic truncation detection method.
  • semantic truncation detection method includes: obtaining the text data to be detected; obtaining the first corpus data, and obtaining a plurality of semantic truncation types according to the first corpus data, wherein the first corpus data is historical text data with semantic truncation; judging the semantic truncation type to which the text data to be detected belongs; Truncated detection results; wherein, the BERT classification model is obtained through the following training steps: obtain business corpus data, wherein the business corpus data includes multiple business text data; select a random position for each business text data to segment, and construct a positive example sentence pair, where the positive example sentence pair is a sentence with a truncated relationship; select any two business text data, and construct a negative example sentence pair, where the negative example sentence pair is a non-truncated
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and that can be accessed by a computer.
  • communication media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media, as are known to those of ordinary skill in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供了一种语义截断检测方法、装置、设备和计算机可读存储介质,语义截断检测方法包括获取待检测文本数据;获取第一语料数据并得到多个语义截断类型;判断待检测文本数据的语义截断类型;根据语义截断类型,通过预设规则和/或BERT分类模型对待检测文本数据进行检测,得到检测结果;BERT分类模型通过以下步骤得到:获取业务语料数据;对每条业务文本数据选取一个随机位置进行切分,构造得到正例句子对;选取任意两条业务文本数据,构造得到负例句子对;根据正例句子对和负例句子对构建训练集,将训练集输入至初始BERT模型中进行训练,得到BERT分类模型;能够更加准确地识别出用户的意图,减少因识别失败而增加的交互次数,提高用户的良好体验感。

Description

语义截断检测方法、装置、设备和计算机可读存储介质
本申请要求于2022年1月18日提交中国专利局、申请号为202210057008.6,发明名称为“语义截断检测方法、装置、设备和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术及自然语言处理领域,尤其涉及一种语义截断检测方法、装置、设备和计算机可读存储介质。
背景技术
随着互联网时代的到来,人工智能逐渐应用于各行各业中,人们使用的终端设备从传统的个人计算机(Personal Computer,PC)、电视、电话转到了智能手机、智能穿戴等设备上来,网络信息也呈现出共享化、个性化、实时化、大数据化等特点。人们追求更高质量的生活,对服务也提出了更高的要求,能不能及时、准确地解决生活中遇到的问题是人们评价提供的服务好坏的重要指标。由于智能客服能够24小时在线为不同用户同时解决问题,能够高效地满足用户的需求,同时可以大大节省大量的人工客服成本。
技术问题
以下是发明人意识到的现有技术的技术问题:在目前的智能客服系统中,一般的交互流程是用户说完需求,然后智能客服机器人接收到用户语音信息后进行识别并提供对应的服务,但是由于用户说话特点的多样性以及实际应用场景的复杂性等多种因素,在实际交互过程中,经常会出现用户说了几个字后停顿了一下,正准备继续说时,客服机器人却已经开始进行回复,此时用户的意图无法被正确识别,导致用户与客服机器人的交互次数增多,这容易给用户带来较差的体验感。但如果将客服机器人等待时间设置延长,用户说完需要等待客服机器人反馈的时间也相应增多,这同样会给用户带来较差的体验,降低用户的满意度。
技术解决方案
第一方面,本申请实施例提供了一种语义截断检测方法,包括:
获取待检测文本数据;
获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;
判断所述待检测文本数据所属的语义截断类型;
根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;
其中,所述BERT分类模型通过以下训练步骤得到:
获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;
对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;
选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;
根据所述正例句子对和所述负例句子对构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
第二方面,本申请实施例还提供了一种语义截断检测装置,包括:
第一获取模块,用于获取待检测文本数据;
第二获取模块,用于获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;
判断模块,用于判断所述待检测文本数据所属的语义截断类型;
检测模块,用于根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;
第三获取模块,用于获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;
正例构造模块,用于对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;
负例构造模块,用于选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;
训练模块,用于根据所述正例句子对和所述负例句子构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
第三方面,本申请实施例还提供了一种计算机设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种语义截断检测方法,其中,所述语义截断检测方法包括:获取待检测文本数据;获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;判断所述待检测文本数据所属的语义截断类型;根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;其中,所述BERT分类模型通过以下训练步骤得到:获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;根据所述正例句子对和所述负例句子对构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
第四方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行一种语义截断检测方法,其中,所述语义截断检测方法包括:获取待检测文本数据;获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;判断所述待检测文本数据所属的语义截断类型;根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;其中,所述BERT分类模型通过以下训练步骤得到:获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;根据所述正例句子对和所述负例句子对构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
有益效果
本申请实施例提出的语义截断检测方法、装置、设备和计算机可读存储介质,通过获取待检测文本数据,判断待检测文本数据所属的语义截断类型,根据预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,基于不同的语义截断类型选择不同的方式对待检测文本数据进行检测,更有针对性地为用户提供交互服务,有利于提高交互过程中的响应能力,另外,设计针对文本截断特点的预训练任务对初始BERT模型进行训练,通过在业务文本数据中构造存在截断关系的上下句作为正例句子对,以及构造存在非截断关系的上下句作 为负例句子对,根据正例句子对和负例句子对构造得到的训练集对模型进行训练,能够令模型更好地学习截断特征的任务,有利于提升模型的识别性能,使得客服机器人在面对各种复杂的实际交互情况时,能够更加准确地识别出用户的意图,减少因识别失败而增加用户和客服机器人的交互次数,有效地改善服务质量,提高用户满意度。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的语义截断检测方法的流程图;
图2是本申请一个实施例提供的BERT分类模型训练方法的流程图;
图3是本申请另一个实施例提供的语义截断检测方法的流程图;
图4是本申请另一个实施例提供的语义截断检测方法的流程图;
图5是本申请另一个实施例提供的语义截断检测方法的流程图;
图6是本申请另一个实施例提供的语义截断检测方法的流程图;
图7是本申请另一个实施例提供的BERT分类模型训练方法的流程图;
图8是本申请另一个实施例提供的语义截断检测装置的结构示意图;
图9是本申请一个实施例提供的计算机设备的结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本文中使用的术语“和/或”仅仅是一种描述关联对象的相同的字段,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
下面结合附图,对本申请实施例作进一步阐述。
如图1和图2所示,图1是本申请一个实施例提供的语义截断检测方法的流程图,该方法包括但不限于有步骤S110至步骤S140:
步骤S110:获取待检测文本数据;
需要说明的是,待检测文本数据由基于人工智能的语音设备采集的用户语音数据转化得到,语音设备采集在交互过程中用户输出的语音数据,进而对语音数据进行识别转换,并生成对应的文本数据,即得到待检测文本数据。语音设备可以为智能手机、智能电器、智能手表等支持语音交互功能的电子设备,语音设备还具备音频输出的功能,从而能够实现人机语音交互,满足用户的交互使用需求。
步骤S120:获取第一语料数据,根据第一语料数据得到多个语义截断类型,其中,第一语料数据为出现语义截断的历史文本数据;
需要说明的是,通过统计在业务应用中出现语义截断的历史文本数据,得到第一语料数据,并基于第一语料数据分析出常见的几种语义截断类型,即分析在交互过程中语句容易出现截断的类型,从而得到多个语义截断类型。
步骤S130:判断待检测文本数据所属的语义截断类型;
步骤S140:根据语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,得到待检测文本数据是否出现语义截断的检测结果;
当获取到来自用户的待检测文本数据后,将待检测文本数据与多个语义截断类型进行比较,从而判断待检测文本数据更符合的截断类型,即得到待检测文本数据所属的语义截断类型,基于不同的语义截断类型,选择不同的检测方式,能够更有针对性地为用户提供交互服务,例如,仅通过预设规则或BERT分类模型对待检测文本数据进行检测,或者结合预设规则和BERT分类模型对待检测文本数据进行检测,从而得到检测结果,检测结果能够表示待检测文本数据是否出现语义截断,便于客服机器人识别出用户的意图,有效地提升用户的交互体验感,进而减少人工坐席服务需求,在一定程度上可以提高呼叫服务中心的工作效率,降低运营成本。
需要说明的是,预设规则用于识别用户在交互过程中是否结束当前对话,即是否出现语义截断,预设规则中通常基于已制定的语言数据库对查询词进行匹配,其中所采用的数据库可以包含常见的语义截断语句,便于判断用户的输出文本数据是否出现语义截断。预设规则包括有多种匹配方法,例如,方法一:头部查询词匹配,针对截断语句中数量较少但相对集中的查询词采用文本精确匹配方式;方法二:特殊查询词匹配,针对一些特殊格式的查询词采用正则匹配方式;方式三:短句查询词匹配,对一些用分类模型难以处理的短句查询词标注词性序列,采用词性序列匹配方式。
如图2所示,BERT分类模型通过以下训练步骤得到:
步骤S210:获取业务语料数据,其中,业务语料数据包括多条业务文本数据;
步骤S220:对每条业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,正例句子对为存在截断关系的上下句;
步骤S230:选取任意两条业务文本数据,构造得到负例句子对,其中,负例句子对为非截断关系的上下句;
步骤S240:根据正例句子对和负例句子对构建训练集,将训练集输入至初始BERT模型中进行训练,得到BERT分类模型。
需要说明的是,基于变换器的双向编码器表示技术(Bidirectional Encoder Representations from Transformers,BERT)模型是一种深度双向的、无监督的语言表示,
仅使用纯文本语料库进行预训练的模型,本申请实施例选择BERT模型作为分类模型,模型结构采用标准的base版BERT,即12-layer,768-hidden,12-heads,110M parameters。通过构建正例句子对和负例句子对能够令初始BERT模型学习句子间截断关系,针对初始BERT模型,通过加入在业务应用中积累的大量业务语料数据,业务语料数据包括有多条业务文本数据,针对文本截断特点设计预训练任务,在预训练阶段的预测下句任务(Next Sentence Prediction,NSP)环节中,对每条业务文本数据选取一个随机位置进行切分,构造得到存在截断关系的上下句,即正例句子对,同时随机选取两条业务文本数据,构造得到非截断关系的上下句,即负例句子对,根据正例句子对和负例句子对构建训练集,将训练集输入至初始BERT模型中进行训练,使得模型在预训练的同时预测文本数据上下句的截断关系,最终得到文本表征效果更好的BERT分类模型,有利于检测文本数据中出现语义截断的精度,使得客服机器人拥有更强的判断用户是否结束当前对话的能力。
根据本申请实施例的技术方案,通过获取待检测文本数据,判断待检测文本数据所属的语义截断类型,根据预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,基于不同的语义截断类型选择不同的方式对待检测文本数据进行检测,更有针对性地为用户提供交互服务,有利于提高交互过程中的响应能力,另外,设计针对文本截断特点的预训练任务对初始BERT模型进行训练,通过在业务文本数据中构造存在截断关系的上下句作为正例句子对,以及构造存在非截断关系的上下句作为负例句子对,根据正例句子对和负例句子对构造得到的训练集对模型进行训练,能够令模型更好地学习截断特征的任务,有利于提升模 型的识别性能,使得客服机器人在面对各种复杂的实际交互情况时,能够更加准确地识别出用户的意图,减少因识别失败而增加用户和客服机器人的交互次数,有效地改善服务质量,提高用户满意度。
基于图1的语义截断检测方法,多个语义截断类型包括第一截断类型、第二截断类型和第三截断类型,预设规则包括第一匹配字典、第二匹配字典和第三匹配字典,步骤S140中根据语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,包括以下至少之一:
步骤S1411:若待检测文本数据属于第一截断类型,根据第一匹配字典对待检测文本数据进行匹配,其中,第一截断类型表示出现语气词;
步骤S1421:若待检测文本数据属于第二截断类型,根据第二匹配字典和BERT分类模型对待检测文本数据进行检测,其中,第二截断类型表示出现停顿或中断词汇;
步骤S1431:若待检测文本数据属于第三截断类型,根据第三匹配字典和BERT分类模型对待检测文本数据进行检测,其中,第三截断类型表示出现口语习惯词汇。
由于用户说话的多样性和业务应用场景的复杂性等多种因素,识别语义截断受到不少挑战,本申请实施例通过统计多种不同应用场景中的业务文本数据,归纳出三种容易造成语义截断的类型,第一截断类型表示出现语气词,例如出现如“啊”、“呃”、“嗯”等词汇,对于语气词截断类型,多出现在短句中,可以直接采用预设规则进行检测,通过第一匹配字典对待检文本数据进行匹配得出检测结果。第二截断类型表示出现停顿或中断词汇,此种类型通常为用户表达时因停顿思考或中断导致语义截断,例如出现如“咨询一下”、“要查”、“请问一下”等词汇,对于第二截断类型,可以采用预设规则与BERT分类模型结合的方式,根据第二匹配字典和BERT分类模型对待检测文本数据进行检测得出检测结果。第三截断类型表示出现口语习惯词汇,此种类型通常为用户表达时因出现口语习惯词汇导致语义截断,例如出现如“这个”、“那个”、“就是”等词汇,对于第三截断类型,同样采用预设规则与BERT分类模型结合的方式,根据第三匹配字典和BERT分类模型对待检测文本数据进行检测得出检测结果。通过根据不同的截断类型选择后续所采用的检测策略,充分考虑到采集文本数据多样性的特点,便于做针对性的判断,大大提高语义截断的识别效率。
如图3所示,在一实施例中,本申请实施例的语义截断检测方法执行以下步骤:
步骤S110:获取待检测文本数据;
步骤S120:获取第一语料数据,根据第一语料数据得到多个语义截断类型;
步骤S130:判断待检测文本数据所属的语义截断类型;
步骤S141:若待检测文本数据属于第一截断类型,根据第一匹配字典对待检测文本数据进行匹配,得到待检测文本数据是否出现语义截断的检测结果;
步骤S142:若待检测文本数据属于第二截断类型,根据第二匹配字典和BERT分类模型对待检测文本数据进行检测,得到待检测文本数据是否出现语义截断的检测结果;
步骤S143:若待检测文本数据属于第三截断类型,根据第三匹配字典和BERT分类模型对待检测文本数据进行检测,得到待检测文本数据是否出现语义截断的检测结果。
在上述的语义截断检测方法中,第一匹配字典预存有多个语气词;步骤S140得到待检测文本数据是否出现语义截断的检测结果,包括:
步骤S1412:若待检测文本数据匹配到第一匹配字典中的语气词,则得到待检测文本数据出现语义截断的检测结果。
由于第一截断类型多出现在短句中,可以采用预设规则进行匹配,预设规则设置有第一匹配字典,第一匹配字典包括多个典型的语气词,在实际应用中,通过第一匹配字典对待检测文本数据进行匹配,如果待检测文本数据中能够精确匹配到相关语气词,则会检测出语义截断,即得到的检测结果为待检测文本数据出现语义截断。需要说明的是,通过第一匹配字典对待检测文本数据进行匹配,可以采用文本精确匹配方式或词性序列匹配方式。
如图4所示,在上述的语义截断检测方法中,第二匹配字典预存有多个停顿词汇和中断 词汇;步骤S1421中根据第二匹配字典和BERT分类模型对待检测文本数据进行检测,包括但不限于步骤S310和步骤S320:
步骤S310:根据第二匹配字典对待检测文本数据的开头及结尾进行匹配;
步骤S320:若待检测文本数据不能匹配到第二匹配字典中的词汇,通过BERT分类模型进行检测并输出概率预测得分,其中,概率预测得分包括截断预测得分和非截断预测得分;
步骤S140中得到待检测文本数据是否出现语义截断的检测结果,包括:
步骤S1422:若截断预测得分高于或等于预设截断阈值,则得到待检测文本数据出现语义截断的检测结果。
对于第二截断类型,可以采用预设规则和BERT分类模型的组合方法,统计一些截断语句中高频出现的停顿词汇和中断词汇,预存于第二匹配字典,由于停顿词汇和中断词汇常见于语句的开头及结尾,在实际应用中,首先通过第二匹配字典对待检测文本数据的开头及结尾进行匹配,如果不能匹配到第二匹配字典中的词汇,则进一步使用BERT分类模型进行检测,待检测文本数据通过BERT分类模型后会输出截断预测得分和非截断预测得分两个类别的概率预测得分,通过设计一个阈值判断机制,引入预设截断阈值,通过比较截断预测得分和预设截断阈值,输出检测结果,若截断预测得分高于或等于预设截断阈值,则检测结果表示出现语义截断,可以理解的是,若截断预测得分低于预设截断阈值,则检测结果表示待检测文本数据为非截断。
通过加入阈值判断机制,能够有效地提升BERT分类模型的识别性能,从而能够更加精准地判断用户是否结束当前对话,从而能够快速准确地识别出用户的意图。需要说明的是,预设截断阈值可根据实际情况设定,本申请实施例通过对不同阈值进行测试,将预设截断阈值设定为0.6,BERT分类模型的检测效果最好。
需要说明的是,若待检测文本数据能够匹配到第二匹配字典中的词汇,则能够直接确定待检测文本数据出现语义截断。通过第二匹配字典对待检测文本数据进行匹配,可以选择特殊查询词匹配或短句查询词匹配。
如图5所示,在上述的语义截断检测方法中,第三匹配字典预存有多个口语习惯词汇;步骤S1431中根据第三匹配字典和BERT分类模型对待检测文本数据进行检测,包括但不限于步骤S410和步骤S420:
步骤S410;根据第三匹配字典对待检测文本数据的结尾进行匹配;
步骤S420:若待检测文本数据不能匹配到第三匹配字典中的词汇,通过BERT分类模型进行检测并输出概率预测得分,其中,概率预测得分包括截断预测得分和非截断预测得分;
步骤S140中得到待检测文本数据是否出现语义截断的检测结果,包括:
步骤S1422:若截断预测得分高于或等于预设截断阈值,则得到待检测文本数据出现语气截断的检测结果。
对于第三截断类型,采用预设规则和BERT分类模型的组合方法,与第二截断类型的实施例类似,通过统计出截断语句中高频出现的口语习惯词汇,建立第三匹配字典,由于口语习惯词汇常见于语句的结尾,在实际应用中,首先通过第三匹配字典对待检测文本数据的结尾进行匹配,如果不能匹配到第三匹配字典中的词汇,则进一步使用BERT分类模型进行检测,待检测文本数据通过BERT分类模型后输出截断预测得分和非截断预测得分两个类别的概率预测得分,通过比较截断预测得分和预设截断阈值,输出检测结果,若截断预测得分高于或等于预设截断阈值,则检测结果表示出现语义截断。
需要说明的是,若待检测文本数据能够匹配到第三匹配字典中的词汇,则能够直接确定待检测文本数据出现语义截断。通过第三匹配字典对待检测文本数据进行匹配,采用文本精确匹配方式和正则匹配方式,其中,第三匹配字典包括有词汇的精确匹配字典及特殊格式匹配字典。
如图6所示,在上述的语义截断检测方法中,步骤S120中获取第一语料数据,根据第一语料数据得到多个语义截断类型,包括但不限于步骤S510至步骤S530:
步骤S510:获取预标注的第一语料数据;
步骤S520:对第一语料数据进行预处理和分词处理,得到第二语料数据;
步骤S530:根据预设语义维度和第二语料数据,得到多个语义截断类型,其中,预设语义维度包括句长、首尾字、句式结构、词性顺序、频率分布至少之一。
数据是分析的重要前提,原始语料数据的积累是首先需要进行的工作。通过获取大量的历史文本数据,将出现语义截断的数据进行标注,在实际应用中,选取一个月的业务数据,根据音转字识别结果和人工转译结果对比,对数据进行标注,得到第一语料数据,然后对第一语料数据进行预处理和分词,得到第二语料数据。为了保证分词的准确性,采用基于业务数据不断优化的分词字典,更加适合业务应用场景。再根据预设语义维度对第二语料数据进行统计分析,得到多个语义截断类型,预设语义维度包括句长、首尾字、句式结构、词性顺序、频率分布至少之一。
在实际应用中,根据文本的长度、首尾字、句式结构、句子词性顺序、分词后的片段频率分布等多个维度进行全方位分析,从而总结归纳出语句容易出现截断的类型。比如“咨询一下”、“我要问一下”这样的文本数据比较多,当重听了这一句的录音,发现用户在说完这句后思考停顿了一下,准备说下一句时,客服机器人却开始回复了。同样地,用户听完一段播报后,无意识地说了“呃”,拖延了一两秒,刚说下一个字时,客服机器人便已经开始回复。虽然这两种都是语义截断,但类型不同。前一种多为用户主观停顿,句式表达多为主语加动词,后一种多以一些语气词的形式出现,无其他内容。在一实施例中,根据预设语义维度和第二语料数据,得到三种语义截断类型,分别为第一截断类型、第二截断类型和第三截断类型,其中,第一截断类型表示出现语气词,第二截断类型表示出现停顿或中断词汇,第三截断类型表示出现口语习惯词汇。
如图7所示,在上述的语义截断检测方法中,BERT分类模型包括全连接层和两个Transformer层,步骤S240中将训练集输入至初始BERT模型中进行训练,包括但不限于步骤S610至步骤S640:
步骤S610:将训练集中的数据输入至初始BERT模型中的Transformer层;
步骤S620:将最后一个Transformer层的输出向量输入至全连接层,输出两个类别的概率预测得分,其中,概率预测得分包括截断预测得分和非截断预测得分;
步骤S630:若截断预测得分高于或等于预设截断阈值,则输出表示出现语义截断的预测结果;
步骤S640:根据训练集和预测结果训练初始BERT模型。
本申请实施例修改了BERT分类模型中间层的部分Transformer单元,将初始BERT模型的十二层层Transformer结构缩减至两层Transformer结构,在不较大地影响模型性能的情况下大大简化了模型结构,相应地模型参数量也大大减少了,经过对模型进行测试,整个模型的训练速度提高了三倍,通过精简模型结构能够极大地提升模型的训练速度以及模型的预测速度,有利于满足企业对业务模型的快速迭代和响应能力的更高需求。
将训练集输入至初始BERT模型中进行训练的详细过程是,将预处理好的训练集中的数据输入至初始BERT模型,数据经过Embedding层(嵌入层)得到文本的表征,然后送入Transformer层中,将最后一个Transformer层隐藏状态的输出向量输入至全连接层,全连接层的输出结果就是两个类别的概率预测得分,即得到截断预测得分和非截断预测得分,通过设计一个阈值判断机制,引入预设截断阈值,通过比较截断预测得分和预设截断阈值,输出预测结果,若截断预测得分高于或等于预设截断阈值,则预测结果表示出现语义截断,根据训练集和预测结果训练初始BERT模型,得到具有良好识别性能的BERT分类模型。
需要说明的是,本实施例中BERT分类模型在训练过程中的预设截断阈值与上述在检测过程中的预设截断阈值为相同的数值,根据多次测试结果,可将预设截断阈值设定为0.6。
在上述的语义截断检测方法中,还包括:
若检测结果表示待检测文本数据出现语义截断,则等待第一预设时间执行响应操作;
若检测结果表示待检测文本数据没有出现语义截断,则直接执行响应操作。
在实际应用中,当用户的语音数据转成文本数据后,通过判断待检测文本数据是否出现语义截断,如果检测结果表示待检测文本数据出现语义截断,则等待第一预设时间执行响应操作,例如延长客服机器人300毫秒的等待时间,能够更加准确地识别出用户的意图,减少因识别失败而增加的交互次数,可以为用户提供更为人性化的交互服务,如果检测结果表示待检测文本数据没有出现语义截断,则直接执行响应操作,客服机器人按正常流程进行回答,可以加快服务的响应速度,有利于提高用户的满意度。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。本申请实施例的语义截断检测方法可以应用于智能客服系统中,在人机对话过程中,通常是通过将用户的语音应答转换为应答文字,将应答文字输入到人机对话系统中进行识别,通过客服机器人为用户提供语音交互服务,如售后问题咨询、操作指导服务等,也可以应用于其他可以使用客服机器人代替人工语音服务等领域,比如教育、医疗等领域的语音服务等。
基于上述语义截断检测方法,下面分别提出本申请的语义截断检测装置、计算机设备和计算机可读存储介质的各个实施例。
如图8所示,图8是本申请一个实施例提供的语义截断检测装置的结构示意图。本申请实施例的语义截断检测装置800包括但不限于第一获取模块810、第二获取模块820、判断模块830、检测模块840、第三获取模块850、正例构造模块860、负例构造模块870、训练模块880。
具体地,第一获取模块810用于获取待检测文本数据;第二获取模块820用于获取第一语料数据,根据第一语料数据得到多个语义截断类型,其中,第一语料数据为出现语义截断的历史文本数据;判断模块830用于判断待检测文本数据所属的语义截断类型;检测模块840用于根据语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,得到待检测文本数据是否出现语义截断的检测结果;第三获取模块850用于获取业务语料数据,其中,业务语料数据包括多条业务文本数据;正例构造模块860用于对每条业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,正例句子对为存在截断关系的上下句;负例构造模块870用于选取任意两条业务文本数据,构造得到负例句子对,其中,负例句子对为非截断关系的上下句;训练模块880用于根据正例句子对和负例句子构建训练集,将训练集输入至初始BERT模型中进行训练,得到BERT分类模型。
根据本申请实施例的技术方案,通过获取待检测文本数据,判断待检测文本数据所属的语义截断类型,根据预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,基于不同的语义截断类型选择不同的方式对待检测文本数据进行检测,更有针对性地为用户提供交互服务,有利于提高交互过程中的响应能力,另外,设计针对文本截断特点的预训练任务对初始BERT模型进行训练,通过在业务文本数据中构造存在截断关系的上下句作为正例句子对,以及构造存在非截断关系的上下句作为负例句子对,根据正例句子对和负例句子对构造得到的训练集对模型进行训练,能够令模型更好地学习截断特征的任务,有利于提升模型的识别性能,使得客服机器人在面对各种复杂的实际交互情况时,能够更加准确地识别出用户的意图,减少因识别失败而增加用户和客服机器人的交互次数,有效地改善服务质量,提高用户满意度。
在上述的语义截断检测装置中,多个语义截断类型包括第一截断类型、第二截断类型和第三截断类型,预设规则包括第一匹配字典、第二匹配字典和第三匹配字典,检测模块中根据语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,包括以下至少之一:
若待检测文本数据属于第一截断类型,根据第一匹配字典对待检测文本数据进行匹配,其中,第一截断类型表示出现语气词;
若待检测文本数据属于第二截断类型,根据第二匹配字典和BERT分类模型对待检测文本数据进行检测,其中,第二截断类型表示出现停顿或中断词汇;
若待检测文本数据属于第三截断类型,根据第三匹配字典和BERT分类模型对待检测文本数据进行检测,其中,第三截断类型表示出现口语习惯词汇。
在上述的语义截断检测装置中,第一匹配字典预存有多个语气词;检测模块中得到待检测文本数据是否出现语义截断的检测结果,具体包括:
若待检测文本数据匹配到第一匹配字典中的语气词,则得到待检测文本数据出现语义截断的检测结果。
在上述的语义截断检测装置中,第二匹配字典预存有多个停顿词汇和中断词汇,检测模块中根据第二匹配字典和BERT分类模型对待检测文本数据进行检测,具体包括:
根据第二匹配字典对待检测文本数据的开头及结尾进行匹配;
若待检测文本数据不能匹配到第二匹配字典中的词汇,通过BERT分类模型进行检测并输出概率预测得分,其中,概率预测得分包括截断预测得分和非截断预测得分。
检测模块中得到待检测文本数据是否出现语义截断的检测结果,具体包括:
若截断预测得分高于或等于预设截断阈值,则得到待检测文本数据出现语义截断的检测结果。
在上述的语义截断检测装置中,第三匹配字典预存有多个口语习惯词汇;检测模块中根据第三匹配字典和BERT分类模型对待检测文本数据进行检测,具体包括:
根据第三匹配字典对待检测文本数据的结尾进行匹配;
若待检测文本数据不能匹配到第三匹配字典中的词汇,通过BERT分类模型进行检测并输出概率预测得分,其中,概率预测得分包括截断预测得分和非截断预测得分。
检测模块中得到待检测文本数据是否出现语义截断的检测结果,具体包括:
若截断预测得分高于或等于预设截断阈值,则得到待检测文本数据出现语气截断的检测结果。
在上述的语义截断检测装置中,第二获取模块,具体用于:
获取预标注的第一语料数据;
对第一语料数据进行预处理和分词处理,得到第二语料数据;
根据预设语义维度和第二语料数据,得到多个语义截断类型,其中,预设语义维度包括句长、首尾字、句式结构、词性顺序、频率分布至少之一。
在上述的语义截断检测装置中,BERT分类模型包括全连接层和两个Transformer层,训练模块中将训练集输入至初始BERT模型中进行训练,具体包括:
将训练集中的数据输入至初始BERT模型中的Transformer层;
将最后一个Transformer层的输出向量输入至全连接层,输出两个类别的概率预测得分,其中,概率预测得分包括截断预测得分和非截断预测得分;
若截断预测得分高于或等于预设截断阈值,则输出表示出现语义截断的预测结果;
根据训练集和预测结果训练初始BERT模型。
在上述的语义截断检测装置中,还包括第一执行模块和第二执行模块,第一执行模块用于在检测结果表示待检测文本数据出现语义截断的情况下,等待第一预设时间执行响应操作,第二执行模块用于在检测结果表示待检测文本数据没有出现语义截断的情况下,直接执行响应操作。
需要说明的是,本申请实施例的语义截断检测装置的具体实施方式及对应的技术效果,可对应参照上述语义截断检测方法的具体实施方式及对应的技术效果。
如图9所示,本申请的一个实施例还提供了一种计算机设备900,该计算机设备900包括:存储器910、处理器920及存储在存储器910上并可在处理器920上运行的计算机程序。
处理器920和存储器910可以通过总线或者其他方式连接。存储器910作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外, 存储器910可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器910可选包括相对于处理器920远程设置的存储器,这些远程存储器可以通过网络连接至该发号器组件。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。本领域技术人员可以理解的是,图9中示出的计算机设备900并不构成对本申请实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。实现上述实施例的语义截断检测方法所需的非暂态软件程序以及指令存储在存储器910中,当被处理器920执行时,执行一种语义截断检测方法,其中,所述语义截断检测方法包括:获取待检测文本数据;获取第一语料数据,根据第一语料数据得到多个语义截断类型,其中,第一语料数据为出现语义截断的历史文本数据;判断待检测文本数据所属的语义截断类型;根据语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,得到待检测文本数据是否出现语义截断的检测结果;其中,BERT分类模型通过以下训练步骤得到:获取业务语料数据,其中,业务语料数据包括多条业务文本数据;对每条业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,正例句子对为存在截断关系的上下句;选取任意两条业务文本数据,构造得到负例句子对,其中,负例句子对为非截断关系的上下句;根据正例句子对和负例句子对构建训练集,将训练集输入至初始BERT模型中进行训练,得到BERT分类模型。
另外,本申请的一个实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,该计算机可读存储介质存储有计算机可执行指令,计算机可执行指令用于执行上述的语义截断检测方法。例如,被上述语义截断检测装置的一个处理器执行,可使得上述处理器执行一种语义截断检测方法,其中,所述语义截断检测方法包括:获取待检测文本数据;获取第一语料数据,根据第一语料数据得到多个语义截断类型,其中,第一语料数据为出现语义截断的历史文本数据;判断待检测文本数据所属的语义截断类型;根据语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对待检测文本数据进行检测,得到待检测文本数据是否出现语义截断的检测结果;其中,BERT分类模型通过以下训练步骤得到:获取业务语料数据,其中,业务语料数据包括多条业务文本数据;对每条业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,正例句子对为存在截断关系的上下句;选取任意两条业务文本数据,构造得到负例句子对,其中,负例句子对为非截断关系的上下句;根据正例句子对和负例句子对构建训练集,将训练集输入至初始BERT模型中进行训练,得到BERT分类模型。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。

Claims (20)

  1. 一种语义截断检测方法,其中,包括:
    获取待检测文本数据;
    获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;
    判断所述待检测文本数据所属的语义截断类型;
    根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;
    其中,所述BERT分类模型通过以下训练步骤得到:
    获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;
    对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;
    选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;
    根据所述正例句子对和所述负例句子对构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
  2. 根据权利要求1所述的语义截断检测方法,其中,所述多个语义截断类型包括第一截断类型、第二截断类型和第三截断类型,所述预设规则包括第一匹配字典、第二匹配字典和第三匹配字典,所述根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,包括以下至少之一:
    若所述待检测文本数据属于第一截断类型,根据所述第一匹配字典对所述待检测文本数据进行匹配,其中,所述第一截断类型表示出现语气词;
    若所述待检测文本数据属于第二截断类型,根据所述第二匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,其中,所述第二截断类型表示出现停顿或中断词汇;
    若所述待检测文本数据属于第三截断类型,根据所述第三匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,其中,所述第三截断类型表示出现口语习惯词汇。
  3. 根据权利要求2所述的语义截断检测方法,其中,所述第一匹配字典预存有多个语气词;所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述待检测文本数据匹配到所述第一匹配字典中的语气词,则得到所述待检测文本数据出现语义截断的检测结果。
  4. 根据权利要求2所述的语义截断检测方法,其中,所述第二匹配字典预存有多个停顿词汇和中断词汇;所述根据所述第二匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,包括:
    根据所述第二匹配字典对所述待检测文本数据的开头及结尾进行匹配;
    若所述待检测文本数据不能匹配到所述第二匹配字典中的词汇,通过所述BERT分类模型进行检测并输出概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述截断预测得分高于或等于预设截断阈值,则得到所述待检测文本数据出现语义截断的检测结果。
  5. 根据权利要求2所述的语义截断检测方法,其中,所述第三匹配字典预存有多个口语习惯词汇;所述根据所述第三匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,包括:
    根据所述第三匹配字典对所述待检测文本数据的结尾进行匹配;
    若所述待检测文本数据不能匹配到所述第三匹配字典中的词汇,通过所述BERT分类模型 进行检测并输出概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述截断预测得分高于或等于预设截断阈值,则得到所述待检测文本数据出现语气截断的检测结果。
  6. 根据权利要求1或2所述的语义截断检测方法,其中,所述获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,包括:
    获取预标注的第一语料数据;
    对所述第一语料数据进行预处理和分词处理,得到第二语料数据;
    根据预设语义维度和所述第二语料数据,得到多个语义截断类型,其中,所述预设语义维度包括句长、首尾字、句式结构、词性顺序、频率分布至少之一。
  7. 根据权利要求1所述的语义截断检测方法,其中,所述BERT分类模型包括全连接层和两个Transformer层,所述将所述训练集输入至初始BERT模型中进行训练,包括:
    将所述训练集中的数据输入至初始BERT模型中的Transformer层;
    将最后一个所述Transformer层的输出向量输入至所述全连接层,输出两个类别的概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    若所述截断预测得分高于或等于预设截断阈值,则输出表示出现语义截断的预测结果;
    根据所述训练集和所述预测结果训练所述初始BERT模型。
  8. 一种语义截断检测装置,其中,包括:
    第一获取模块,用于获取待检测文本数据;
    第二获取模块,用于获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;
    判断模块,用于判断所述待检测文本数据所属的语义截断类型;
    检测模块,用于根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;
    第三获取模块,用于获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;
    正例构造模块,用于对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;
    负例构造模块,用于选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;
    训练模块,用于根据所述正例句子对和所述负例句子构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
  9. 一种计算机设备,其中,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种语义截断检测方法,其中,所述语义截断检测方法包括:
    获取待检测文本数据;
    获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;
    判断所述待检测文本数据所属的语义截断类型;
    根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;
    其中,所述BERT分类模型通过以下训练步骤得到:
    获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;
    对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;
    选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;
    根据所述正例句子对和所述负例句子对构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
  10. 根据权利要求9所述的计算机设备,其中,所述多个语义截断类型包括第一截断类型、第二截断类型和第三截断类型,所述预设规则包括第一匹配字典、第二匹配字典和第三匹配字典,所述根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,包括以下至少之一:
    若所述待检测文本数据属于第一截断类型,根据所述第一匹配字典对所述待检测文本数据进行匹配,其中,所述第一截断类型表示出现语气词;
    若所述待检测文本数据属于第二截断类型,根据所述第二匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,其中,所述第二截断类型表示出现停顿或中断词汇;
    若所述待检测文本数据属于第三截断类型,根据所述第三匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,其中,所述第三截断类型表示出现口语习惯词汇。
  11. 根据权利要求10所述的计算机设备,其中,所述第一匹配字典预存有多个语气词;所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述待检测文本数据匹配到所述第一匹配字典中的语气词,则得到所述待检测文本数据出现语义截断的检测结果。
  12. 根据权利要求10所述的计算机设备,其中,所述第二匹配字典预存有多个停顿词汇和中断词汇;所述根据所述第二匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,包括:
    根据所述第二匹配字典对所述待检测文本数据的开头及结尾进行匹配;
    若所述待检测文本数据不能匹配到所述第二匹配字典中的词汇,通过所述BERT分类模型进行检测并输出概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述截断预测得分高于或等于预设截断阈值,则得到所述待检测文本数据出现语义截断的检测结果。
  13. 根据权利要求10所述的计算机设备,其中,所述第三匹配字典预存有多个口语习惯词汇;所述根据所述第三匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,包括:
    根据所述第三匹配字典对所述待检测文本数据的结尾进行匹配;
    若所述待检测文本数据不能匹配到所述第三匹配字典中的词汇,通过所述BERT分类模型进行检测并输出概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述截断预测得分高于或等于预设截断阈值,则得到所述待检测文本数据出现语气截断的检测结果。
  14. 根据权利要求9或10所述的计算机设备,其中,所述获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,包括:
    获取预标注的第一语料数据;
    对所述第一语料数据进行预处理和分词处理,得到第二语料数据;
    根据预设语义维度和所述第二语料数据,得到多个语义截断类型,其中,所述预设语义维度包括句长、首尾字、句式结构、词性顺序、频率分布至少之一。
  15. 一种计算机可读存储介质,其中,存储有计算机可执行指令,所述计算机可执行指令用于执行一种语义截断检测方法,其中,所述语义截断检测方法包括:
    获取待检测文本数据;
    获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,其中,所述第一语料数据为出现语义截断的历史文本数据;
    判断所述待检测文本数据所属的语义截断类型;
    根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,得到所述待检测文本数据是否出现语义截断的检测结果;
    其中,所述BERT分类模型通过以下训练步骤得到:
    获取业务语料数据,其中,所述业务语料数据包括多条业务文本数据;
    对每条所述业务文本数据选取一个随机位置进行切分,构造得到正例句子对,其中,所述正例句子对为存在截断关系的上下句;
    选取任意两条所述业务文本数据,构造得到负例句子对,其中,所述负例句子对为非截断关系的上下句;
    根据所述正例句子对和所述负例句子对构建训练集,将所述训练集输入至初始BERT模型中进行训练,得到所述BERT分类模型。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述多个语义截断类型包括第一截断类型、第二截断类型和第三截断类型,所述预设规则包括第一匹配字典、第二匹配字典和第三匹配字典,所述根据所述语义截断类型,通过预设规则和/或预先训练好的BERT分类模型对所述待检测文本数据进行检测,包括以下至少之一:
    若所述待检测文本数据属于第一截断类型,根据所述第一匹配字典对所述待检测文本数据进行匹配,其中,所述第一截断类型表示出现语气词;
    若所述待检测文本数据属于第二截断类型,根据所述第二匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,其中,所述第二截断类型表示出现停顿或中断词汇;
    若所述待检测文本数据属于第三截断类型,根据所述第三匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,其中,所述第三截断类型表示出现口语习惯词汇。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述第一匹配字典预存有多个语气词;所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述待检测文本数据匹配到所述第一匹配字典中的语气词,则得到所述待检测文本数据出现语义截断的检测结果。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述第二匹配字典预存有多个停顿词汇和中断词汇;所述根据所述第二匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,包括:
    根据所述第二匹配字典对所述待检测文本数据的开头及结尾进行匹配;
    若所述待检测文本数据不能匹配到所述第二匹配字典中的词汇,通过所述BERT分类模型进行检测并输出概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述截断预测得分高于或等于预设截断阈值,则得到所述待检测文本数据出现语义截断的检测结果。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述第三匹配字典预存有多个口语习惯词汇;所述根据所述第三匹配字典和所述BERT分类模型对所述待检测文本数据进行检测,包括:
    根据所述第三匹配字典对所述待检测文本数据的结尾进行匹配;
    若所述待检测文本数据不能匹配到所述第三匹配字典中的词汇,通过所述BERT分类模型进行检测并输出概率预测得分,其中,所述概率预测得分包括截断预测得分和非截断预测得分;
    所述得到所述待检测文本数据是否出现语义截断的检测结果,包括:
    若所述截断预测得分高于或等于预设截断阈值,则得到所述待检测文本数据出现语气截断的检测结果。
  20. 根据权利要求15或16所述的计算机可读存储介质,其中,所述获取第一语料数据,根据所述第一语料数据得到多个语义截断类型,包括:
    获取预标注的第一语料数据;
    对所述第一语料数据进行预处理和分词处理,得到第二语料数据;
    根据预设语义维度和所述第二语料数据,得到多个语义截断类型,其中,所述预设语义维度包括句长、首尾字、句式结构、词性顺序、频率分布至少之一。
PCT/CN2022/090745 2022-01-18 2022-04-29 语义截断检测方法、装置、设备和计算机可读存储介质 WO2023137920A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210057008.6 2022-01-18
CN202210057008.6A CN114372476B (zh) 2022-01-18 2022-01-18 语义截断检测方法、装置、设备和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2023137920A1 true WO2023137920A1 (zh) 2023-07-27

Family

ID=81143981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090745 WO2023137920A1 (zh) 2022-01-18 2022-04-29 语义截断检测方法、装置、设备和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114372476B (zh)
WO (1) WO2023137920A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114372476B (zh) * 2022-01-18 2023-09-12 平安科技(深圳)有限公司 语义截断检测方法、装置、设备和计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950269A (zh) * 2020-08-21 2020-11-17 清华大学 文本语句处理方法、装置、计算机设备和存储介质
CN112199499A (zh) * 2020-09-29 2021-01-08 京东方科技集团股份有限公司 文本划分方法、文本分类方法、装置、设备及存储介质
CN112256849A (zh) * 2020-10-20 2021-01-22 深圳前海微众银行股份有限公司 模型训练方法、文本检测方法、装置、设备和存储介质
CN113657094A (zh) * 2021-08-17 2021-11-16 深圳科卫机器人科技有限公司 语义交互意图分析方法、装置、计算机设备及存储介质
CN113935331A (zh) * 2021-10-22 2022-01-14 平安科技(深圳)有限公司 异常语义截断检测方法、装置、设备及介质
CN114372476A (zh) * 2022-01-18 2022-04-19 平安科技(深圳)有限公司 语义截断检测方法、装置、设备和计算机可读存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5214985B2 (ja) * 2008-01-22 2013-06-19 日本電信電話株式会社 テキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950269A (zh) * 2020-08-21 2020-11-17 清华大学 文本语句处理方法、装置、计算机设备和存储介质
CN112199499A (zh) * 2020-09-29 2021-01-08 京东方科技集团股份有限公司 文本划分方法、文本分类方法、装置、设备及存储介质
CN112256849A (zh) * 2020-10-20 2021-01-22 深圳前海微众银行股份有限公司 模型训练方法、文本检测方法、装置、设备和存储介质
CN113657094A (zh) * 2021-08-17 2021-11-16 深圳科卫机器人科技有限公司 语义交互意图分析方法、装置、计算机设备及存储介质
CN113935331A (zh) * 2021-10-22 2022-01-14 平安科技(深圳)有限公司 异常语义截断检测方法、装置、设备及介质
CN114372476A (zh) * 2022-01-18 2022-04-19 平安科技(深圳)有限公司 语义截断检测方法、装置、设备和计算机可读存储介质

Also Published As

Publication number Publication date
CN114372476B (zh) 2023-09-12
CN114372476A (zh) 2022-04-19

Similar Documents

Publication Publication Date Title
CN108509619B (zh) 一种语音交互方法及设备
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
CN107818798B (zh) 客服服务质量评价方法、装置、设备及存储介质
CN111651996B (zh) 摘要生成方法、装置、电子设备及存储介质
CN112100349A (zh) 一种多轮对话方法、装置、电子设备及存储介质
CN108710704B (zh) 对话状态的确定方法、装置、电子设备及存储介质
KR20170047268A (ko) 오펀 발화 검출 시스템 및 방법
US11574637B1 (en) Spoken language understanding models
CN106407393B (zh) 一种用于智能设备的信息处理方法及装置
WO2023108994A1 (zh) 一种语句生成方法及电子设备、存储介质
CN108538294B (zh) 一种语音交互方法及装置
CN109119070A (zh) 一种语音端点检测方法、装置、设备及存储介质
WO2021063101A1 (zh) 基于人工智能的语音断点检测方法、装置和设备
CN113408287B (zh) 实体识别方法、装置、电子设备及存储介质
WO2023065633A1 (zh) 异常语义截断检测方法、装置、设备及介质
WO2023137920A1 (zh) 语义截断检测方法、装置、设备和计算机可读存储介质
CN110956958A (zh) 搜索方法、装置、终端设备及存储介质
CN113763962A (zh) 音频处理方法、装置、存储介质及计算机设备
WO2024055603A1 (zh) 一种未成年人文本识别方法及装置
CN112528628A (zh) 一种文本处理的方法、装置及电子设备
KR20210123545A (ko) 사용자 피드백 기반 대화 서비스 제공 방법 및 장치
CN110809796B (zh) 具有解耦唤醒短语的语音识别系统和方法
CN112687296B (zh) 音频不流利的识别方法、装置、设备及可读存储介质
KR20230116143A (ko) 상담 유형 분류 시스템
CN114358019A (zh) 意图预测模型的训练方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921351

Country of ref document: EP

Kind code of ref document: A1