CN112749544A - Training method and system for paragraph segmentation model - Google Patents

Training method and system for paragraph segmentation model Download PDF

Info

Publication number
CN112749544A
CN112749544A CN202011583136.1A CN202011583136A CN112749544A CN 112749544 A CN112749544 A CN 112749544A CN 202011583136 A CN202011583136 A CN 202011583136A CN 112749544 A CN112749544 A CN 112749544A
Authority
CN
China
Prior art keywords
model
data
segmentation
field
punctuation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011583136.1A
Other languages
Chinese (zh)
Other versions
CN112749544B (en
Inventor
秦文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN202011583136.1A priority Critical patent/CN112749544B/en
Publication of CN112749544A publication Critical patent/CN112749544A/en
Application granted granted Critical
Publication of CN112749544B publication Critical patent/CN112749544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the invention provides a method for training a paragraph segmentation model. The method comprises the following steps: pre-training a neural network model of the paragraph segmentation model by using the general segmentation data; and training a coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field. The embodiment of the invention also provides a training system of the paragraph segmentation model. The embodiment of the invention aims at the problem that a large amount of precise standard data are needed to be trained in a specific field, trains on a large amount of easily-obtained general segmented data, and finally carries out fine adjustment on a small amount of field precise standard data, so that the field adaptation cost can be effectively reduced. The method aims at the problem that the output of an upstream punctuation model is sensitive. The robustness of the segmented model is improved, the dependence of the model on the upstream punctuation is reduced, and the output of the upstream punctuation can be corrected.

Description

Training method and system for paragraph segmentation model
Technical Field
The invention relates to the field of intelligent voice, in particular to a method and a system for training a paragraph segmentation model.
Background
Paragraph segmentation is becoming more and more important today, for example, to convert a recording of a class spoken by a teacher into text, since the text converted from the recording is a large pile of characters. By paragraph segmentation, a plurality of paragraphs can be separated from a stack of characters, so that the paragraph boundary is better seen when the user reviews the characters.
The existing methods on the market are as follows: a paragraph segmentation method of a traditional Machine learning method such as an SVM (Support Vector Machine), a paragraph segmentation method of a neural network such as an LSTM (Long Short-Term Memory based network), and the like.
The paragraph segmentation is essentially a classification task, and the model needs to make a prediction on each sentence in the chapter, and whether to wrap lines in the sentence, so as to complete the paragraph segmentation of the text.
A paragraph segmentation method based on an SVM mainly learns a hyperplane and separates segmentation sentences and non-segmentation sentences in a high latitude space.
The paragraph segmentation method based on LSTM is characterized in that an Encoder (Encoder) of a deep learning model represented by LSTM is used for extracting text features, and whether each sentence needs line feed is predicted according to the text features.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
1. the cost of domain adaptation is high. The text with paragraph segmentation information is usually regular news manuscript, and the data is objective in scale and easy to obtain. The model trained on the basis is poor in paragraph segmentation in the new field, and a large amount of texts in the corresponding field need to be labeled manually for retraining. This can only rely on a large amount of manual annotation data to learn from scratch, since the training model does not contain any general knowledge of text processing.
2. Sensitive to punctuation output with upstream. The upstream punctuation model has poor performance in some fields of texts, and particularly, the F1 value of the punctuation at the end of the table with a period greatly affects the performance of the downstream segmentation model, namely, the robustness of the segmentation model is poor.
Disclosure of Invention
The method aims to at least solve the problems that in the prior art, the field adaptation is high in cost and sensitive to punctuation output with upstream.
In a first aspect, an embodiment of the present invention provides a method for training a paragraph segmentation model, including:
pre-training a neural network model of the paragraph segmentation model by using common segmentation data;
and training the coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field.
In a second aspect, an embodiment of the present invention provides a training system for a paragraph segmentation model, including:
the model pre-training program module is used for pre-training the neural network model of the paragraph segmentation model by utilizing the general segmentation data;
and the segmentation model training program module is used for training a coding layer related to feature extraction in the pre-trained segment segmentation model based on the field segmentation data to obtain the segment segmentation model in the adaptive field.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a segmentation model according to any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for training a segmentation model according to any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: aiming at the problem that a large amount of fine-scale data are needed to be trained in a specific field, a pretraining model such as BERT is used for training on a large amount of easily-obtained general segmentation data, and finally fine adjustment is carried out on a small amount of field fine-scale data, so that the field adaptation cost can be effectively reduced. Aiming at the problem that the output of an upstream punctuation model is sensitive, new segmentation training data is constructed by combining segmentation information and the output of the upstream punctuation, and the quantity distribution of the punctuation before segmentation marking is counted so as to introduce new sentence segmentation punctuation. The robustness of the segmented model is improved, the dependence of the model on the upstream punctuation is reduced, and the output of the upstream punctuation can be corrected.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for training a paragraph segmentation model according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a paragraph segmentation general procedure of a method for training a paragraph segmentation model according to an embodiment of the present invention;
FIG. 3 is a structural data diagram of a method for training a paragraph segmentation model according to an embodiment of the present invention;
FIG. 4 is a data diagram of error correction effects of a segmentation model on punctuation model results of a training method of a paragraph segmentation model according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a training system for a paragraph segmentation model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a method for training a paragraph segmentation model according to an embodiment of the present invention, including the following steps:
s11: pre-training a neural network model of the paragraph segmentation model by using common segmentation data;
s12: and training the coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field.
And the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the paragraph segmentation model in the adaptation field are shared and used for learning and extracting lexical, syntactic and grammatical features.
In the embodiment, the requirement of the new field of the existing segmentation model adaptation for data annotation is large, mainly because the conventional training scheme does not consider that the feature extraction part of the text technology in the bottom layer of the Natural Language Processing (NLP) technology can be shared.
For step S11, the universal corpus is relatively easy to obtain, taking the neural network transducer currently mainstream in NLP as an example, the network structure generally has several layers, the bottom coding layer generally learns general linguistic knowledge such as lexical, syntactic, and grammatical knowledge for feature extraction, and the upper coding layer learns knowledge related to specific tasks. Therefore, a Transformer model trained by mass data is used on one task, and the coding layer at the bottom can be used on NLP tasks of other small data, so that the training overhead is reduced. By using the method, the neural network model of the paragraph segmentation model is pre-trained by using massive general segmentation data.
As one embodiment, the neural network model includes a BERT model.
Considering that the encoder of the Transformer has a self-attention mechanism and has a bidirectional training function. Semantic representations at sentence level higher than words can be obtained, and in order to adapt to transfer learning under multitask, BERT designs a more general input layer and output layer. The BERT model was further chosen because of the low fine tuning cost of the model.
For step S12, a small amount of domain segmentation data is also needed to perform fine tuning training on the coding layer (e.g., the bottom coding layer in the above) related to feature extraction in the paragraph segmentation model trained in step S11, so that the cost of domain adaptation can be effectively reduced.
According to the embodiment, aiming at the problem that a large amount of fine-scale data are needed to be trained in a specific field, a pretraining model such as BERT is used for training on a large amount of easily-obtained general segmented data, and finally fine adjustment is carried out on a small amount of field fine-scale data, so that the field adaptation cost can be effectively reduced.
As an implementation manner, in this embodiment, the domain segmentation data is generated by an upstream punctuation model and segmentation artificial labeling data, and includes:
inputting the original field data into the upper cursor point model to obtain segmented field punctuation data;
receiving segmented artificial labeling data manually labeling the original field data;
determining a sentence ending symbol set based on punctuation types in the segmented artificial labeling data, and segmenting the original field data to obtain segmented artificial field punctuation data;
and generating field segmentation data with punctuation information and segmentation information based on the field punctuation data and the artificial field punctuation data.
As an embodiment, before the inputting the raw domain data into the upstream punctuation model, the method further comprises: and performing punctuation removal processing on the original field data.
In the present embodiment, the existing segmentation model has a large dependency on the punctuation output of the upstream, mainly because the existing techniques default that the punctuation output quality of the upstream is high, and therefore, the division of sentence units completely depends on the punctuation of the end of the table sentence such as the period of the upstream. However, in an actual business scenario such as spoken dialogue, the output quality of the upstream punctuation model is poor, and in particular, the prediction of sentence end symbols is inaccurate. Statistical results show that under such scenes, the punctuation prediction position of the punctuation model is generally accurate but the species prediction is wrong. We try to completely decouple and partially decouple the segmented model and the punctuation model, and from the practical usable viewpoint, we finally select the partial decoupling as the final scheme, and the specific flow of the partial decoupling is shown in fig. 2:
preparing partial field data, determining whether punctuation removal processing is needed or not according to whether the field data are provided with punctuations or not, and if the field data do not have the punctuations, directly inputting the field data into the upper vernier point model; and if the field data has punctuation, performing punctuation removal processing. Since the subsequent steps are performed back to the segmentation process, the domain data we prepared does not need punctuation in this step.
After the punctuation is removed, the field data are respectively input into the upper vernier point model to obtain the field data with the upstream punctuation output, and meanwhile, the segmentation manual marking data with manual marking are also obtained by manually carrying out segmentation marking.
And (3) combining the segmented artificial labeling data and the data output by the upstream punctuation model, counting the punctuation types appearing in front of the artificial segmentation mark, and forming a sentence ending symbol set which is used for dividing the input text into sentences and constructing training data which is in the field and has punctuation information and segmentation information.
Selecting whether the general segmented corpus needs to be used for carrying out a first round of fine tuning training based on a pre-training model according to business needs, wherein generally, if the trained model is a special service model in a certain specific field, the model can be selected not to be trained on the general segmented corpus, otherwise, the general segmented corpus is used by default. After the pre-training is completed, a second round of fine tuning training based on the pre-training model is performed by using the domain segmentation data, which is already described in steps S11 and S12 and will not be described herein again.
After the paragraph segmentation model is trained, the paragraph segmentation model can be used for receiving a large paragraph text input by a user and carrying out paragraph segmentation based on the paragraph segmentation model. The punctuation of the last sentence of a segmented paragraph is not a conventional ending punctuation (e.g., comma) for periods, question marks, etc. Then these ending punctuations that are not conventional are uniformly modified to periods. Therefore, the segmented text better accords with the use rule of punctuation and is fed back to the user.
According to the embodiment, aiming at the problem that the output of an upstream punctuation model is sensitive, new segmentation training data is constructed by combining segmentation information and the output of the upstream punctuation, and the quantity distribution of the punctuation before segmentation marking is counted so as to introduce new sentence segmentation punctuation. For example, the statistical result shows that besides the punctuations of ending the table such as periods, commas also appear before the segmentation markers in large numbers, and we use the commas as sentence segmentation punctuations and train and predict the sentences divided according to the new segmentation punctuation set. If it is finally predicted that segmentation is required at a certain comma, we modify the comma to a period and wrap the line with segments. Therefore, the robustness of the segmented model is improved, the dependence of the model on the upstream punctuations is reduced, and the output of the upstream punctuations can be corrected.
The method was tested for objective evaluation (F1 value):
the paragraph segmentation model trained without the method: 24
The paragraph segmentation model trained by the method comprises the following steps: 94
Subjective evaluation (manual score, 42 points full):
the paragraph segmentation model trained without the method: 20.3
The paragraph segmentation model trained by the method comprises the following steps: 33.3
And (4) conclusion: the segmentation quality can be obviously improved, and only a small amount of labeled corpora are needed.
The paragraph segmentation model is partially decoupled from the punctuation model:
as can be seen from FIG. 3, the segmented model, which is not decoupled from the upstream punctuation model, is greatly affected by the output of the upstream punctuation, and its F1 value at the segment sharply decreases from 88 to 36 when the punctuation changes from an artificial punctuation to a system punctuation.
And the performance is stable all the time by adopting a segmented model of a partial decoupling scheme, and F1 values of an artificial punctuation mark and a system punctuation mark at the segmentation position are respectively 92 and 94.
And (4) conclusion: the partial decoupling scheme can obviously improve the robustness of the model and can bring the benefit of improving the segment quality.
The deeper effect is as shown in fig. 4, the segmented model evaluates the error correction effect of the punctuation model result, and the conclusion is that: the scheme can also additionally improve the performance of the punctuation model, and further improve the text reading experience of a user.
On the other hand, complete decoupling is considered as our alternative:
slicing the text to obtain a plurality of sliced texts;
judging whether the slice text needs to be segmented or not based on a paragraph segmentation model;
and if segmentation is needed, inputting the slice text into an upper cursor point model, and determining the segmented position of the slice text based on the output of the upper punctuation model.
In the present embodiment, the principle of the complete structure is as follows: the input of the segmentation model is consistent with the input of the punctuation model. And slicing a certain text to be segmented according to the size of a certain fixed window, and then predicting whether the certain text needs to be segmented or not. And combining the output results of the standard point model in each window, the position of the specific segment can be determined.
Further, if the window size is selected properly, the number of punctuations at the end of each slice can be less than or equal to one according to the result of the punctuation model, so that the position of a specific segment can be determined more accurately.
It can be seen from this embodiment that this segmentation approach is completely decoupled from the upstream punctuation model, i.e., the segmentation result is completely determined by the segmentation model, and the upstream punctuation model only provides the specific position of the segmentation if necessary, i.e., the segmentation result is not affected by the upstream punctuation at all, and the robustness of the model is significantly improved.
Fig. 5 is a schematic structural diagram of a training system for a paragraph segmentation model according to an embodiment of the present invention, which can execute the method for training a paragraph segmentation model according to any of the above embodiments and is configured in a terminal.
The present embodiment provides a training system 10 for a paragraph segmentation model, which includes: a model pre-training program module 11 and a segmentation model training program module 12.
The model pre-training program module 11 is configured to pre-train a neural network model of the paragraph segmentation model by using general segmentation data; the segmentation model training program module 12 is configured to train, based on the domain segmentation data, an encoding layer related to feature extraction in the pre-trained segment segmentation model to obtain a segment segmentation model in the adaptation domain.
Further, the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the paragraph segmentation model in the adaptation field are shared and used for learning and extracting the lexical, syntactic and grammatical features.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the paragraph segmentation model in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
pre-training a neural network model of the paragraph segmentation model by using common segmentation data;
and training the coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field.
As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a paragraph segmentation model in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a segmentation model according to any of the embodiments of the present invention.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for training a paragraph segmentation model, comprising:
pre-training a neural network model of the paragraph segmentation model by using common segmentation data;
and training the coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field.
2. The method according to claim 1, wherein the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the field-adapted paragraph segmentation model are shared for learning and extracting lexical, syntactic and grammatical features.
3. The method of claim 1, wherein the neural network model comprises a BERT model.
4. The method of claim 1, wherein the domain segmentation data is generated from an upstream punctuation model and segmentation manual annotation data, comprising:
inputting the original field data into the upper cursor point model to obtain segmented field punctuation data;
receiving segmented artificial labeling data manually labeling the original field data;
determining a sentence ending symbol set based on punctuation types in the segmented artificial labeling data, and segmenting the original field data to obtain segmented artificial field punctuation data;
and generating field segmentation data with punctuation information and segmentation information based on the field punctuation data and the artificial field punctuation data.
5. The method of claim 4, wherein prior to said inputting raw domain data into an upstream punctuation model, the method further comprises: and performing punctuation removal processing on the original field data.
6. The method of any of claims 1-5, wherein the domain segment data has a smaller data volume than the generic segment data.
7. A system for training a paragraph segmentation model, comprising:
the model pre-training program module is used for pre-training the neural network model of the paragraph segmentation model by utilizing the general segmentation data;
and the segmentation model training program module is used for training a coding layer related to feature extraction in the pre-trained segment segmentation model based on the field segmentation data to obtain the segment segmentation model in the adaptive field.
8. The system of claim 7, wherein the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the field-adapted paragraph segmentation model are shared for learning to extract lexical, syntactic and grammatical features.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-6.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202011583136.1A 2020-12-28 2020-12-28 Training method and system of paragraph segmentation model Active CN112749544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011583136.1A CN112749544B (en) 2020-12-28 2020-12-28 Training method and system of paragraph segmentation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011583136.1A CN112749544B (en) 2020-12-28 2020-12-28 Training method and system of paragraph segmentation model

Publications (2)

Publication Number Publication Date
CN112749544A true CN112749544A (en) 2021-05-04
CN112749544B CN112749544B (en) 2024-04-30

Family

ID=75646287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011583136.1A Active CN112749544B (en) 2020-12-28 2020-12-28 Training method and system of paragraph segmentation model

Country Status (1)

Country Link
CN (1) CN112749544B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641793A (en) * 2021-08-16 2021-11-12 国网安徽省电力有限公司电力科学研究院 Retrieval system for long text matching optimization aiming at power standard
CN113673255A (en) * 2021-08-25 2021-11-19 北京市律典通科技有限公司 Text function region splitting method and device, computer equipment and storage medium
CN113641793B (en) * 2021-08-16 2024-05-07 国网安徽省电力有限公司电力科学研究院 Retrieval system for long text matching optimization aiming at electric power standard

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN110427482A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 A kind of abstracting method and relevant device of object content
CN111553147A (en) * 2020-03-27 2020-08-18 南京工业大学 BERT model based on N-gram and semantic segmentation method
CN111930937A (en) * 2020-06-28 2020-11-13 山东师范大学 BERT-based intelligent government affair text multi-classification method and system
CN111931482A (en) * 2020-09-22 2020-11-13 苏州思必驰信息科技有限公司 Text segmentation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN110427482A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 A kind of abstracting method and relevant device of object content
CN111553147A (en) * 2020-03-27 2020-08-18 南京工业大学 BERT model based on N-gram and semantic segmentation method
CN111930937A (en) * 2020-06-28 2020-11-13 山东师范大学 BERT-based intelligent government affair text multi-classification method and system
CN111931482A (en) * 2020-09-22 2020-11-13 苏州思必驰信息科技有限公司 Text segmentation method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641793A (en) * 2021-08-16 2021-11-12 国网安徽省电力有限公司电力科学研究院 Retrieval system for long text matching optimization aiming at power standard
CN113641793B (en) * 2021-08-16 2024-05-07 国网安徽省电力有限公司电力科学研究院 Retrieval system for long text matching optimization aiming at electric power standard
CN113673255A (en) * 2021-08-25 2021-11-19 北京市律典通科技有限公司 Text function region splitting method and device, computer equipment and storage medium
CN113673255B (en) * 2021-08-25 2023-06-30 北京市律典通科技有限公司 Text function area splitting method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112749544B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN110852087B (en) Chinese error correction method and device, storage medium and electronic device
CN110110041B (en) Wrong word correcting method, wrong word correcting device, computer device and storage medium
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN110674629A (en) Punctuation mark model and its training method, equipment and storage medium
CN109949799B (en) Semantic parsing method and system
CN111737961B (en) Method and device for generating story, computer equipment and medium
CN111723207B (en) Intention identification method and system
CN111680129B (en) Training method and system of semantic understanding system
CN110765270A (en) Training method and system of text classification model for spoken language interaction
CN112686051A (en) Semantic recognition model training method, recognition method, electronic device, and storage medium
CN116151220A (en) Word segmentation model training method, word segmentation processing method and device
CN111160026A (en) Model training method and device, and method and device for realizing text processing
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN112749544B (en) Training method and system of paragraph segmentation model
CN111199151A (en) Data processing method and data processing device
CN111079433A (en) Event extraction method and device and electronic equipment
CN111128122B (en) Method and system for optimizing rhythm prediction model
CN111462734B (en) Semantic slot filling model training method and system
CN111090970B (en) Text standardization processing method after voice recognition
CN110969005A (en) Method and device for determining similarity between entity corpora
CN111046674A (en) Semantic understanding method and device, electronic equipment and storage medium
CN114297372A (en) Personalized note generation method and system
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN115129843A (en) Dialog text abstract extraction method and device
CN111090720B (en) Hot word adding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant