CN114707487A - Text processing method, text processing device, storage medium and electronic device - Google Patents

Text processing method, text processing device, storage medium and electronic device Download PDF

Info

Publication number
CN114707487A
CN114707487A CN202210243644.8A CN202210243644A CN114707487A CN 114707487 A CN114707487 A CN 114707487A CN 202210243644 A CN202210243644 A CN 202210243644A CN 114707487 A CN114707487 A CN 114707487A
Authority
CN
China
Prior art keywords
chunk
text
vector
word vector
chunks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210243644.8A
Other languages
Chinese (zh)
Inventor
刘畅
王亦宁
刘升平
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202210243644.8A priority Critical patent/CN114707487A/en
Publication of CN114707487A publication Critical patent/CN114707487A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text processing method, a text processing device, a storage medium and an electronic device. The text processing method comprises the following steps: firstly, cutting apart a long text by taking basic punctuation marks as partitions, and respectively modeling the cut-apart clauses by using a pre-training model; and then, aiming at the model expression obtained in the previous step, a bidirectional recurrent neural network is used for establishing context dependence between the clauses, so that the block analysis is completed by taking the clauses as units. The embodiment of the invention breaks through the length limitation of the pre-training model, combines the advantage of context time sequence relation dependency between modeling clauses of the recurrent neural network model, and can better solve the problems of long text modeling and segmentation so as to at least solve the technical problem of low text matching accuracy in the prior art.

Description

Text processing method and device, storage medium and electronic device
Technical Field
The present invention relates to the field of text processing, and in particular, to a text processing method, a text processing apparatus, a storage medium, and an electronic apparatus.
Background
The chunking analysis refers to the process of segmenting a text into different segments according to the text content, wherein the medical text (such as a medical record text and the like) is generally long, and the current method for chunking analysis mainly models a circulatory neural network and then performs sequence labeling on a network top layer connection condition random field so as to complete chunking analysis.
The current method based on sequence labeling has poor model performance. With the continuous development of pre-training models represented by BERT, a method based on the pre-training model becomes a main method in the NLP field; however, the pre-training model has more length limitation on the text length which can be processed, and is difficult to be directly used for processing long texts; pre-trained models designed specifically for long text modeling, in turn, are mostly computationally expensive and difficult to deploy in a production environment (e.g., XLNet).
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a text processing method and device, a storage medium and an electronic device, and at least solves the technical problem that in the prior art, the text segmentation accuracy is low.
According to an aspect of an embodiment of the present invention, there is provided a text processing method, including: segmenting a target text according to a preset separator to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words; inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk; inputting the word vector corresponding to each chunk into a bidirectional recurrent neural network, and determining the dependency relationship between the chunks; and processing the chunks in the chunk set according to the dependency relationship to obtain the target text segmentation result.
Optionally, the processing the chunks in the chunk set according to the dependency relationship to obtain the target text segmentation result includes: and combining the chunks with the same dependency relationship to obtain the target text segmentation result.
Optionally, the inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk includes: and under the condition that the target text D is segmented into N words, and the chunk set comprises the N words, carrying out word vector coding through the pre-training model to obtain a word vector of each word in an N x m format, wherein N represents the number of words in each word, and the feature vector dimension of the m pre-training model.
Optionally, after the word vector coding is performed through the pre-training model to obtain the word vector in the n × m format of each word, the method further includes: and converting the word vector of each clause into a one-dimensional vector by using a method of calculating average pooling to obtain a text sequence corresponding to the target text.
Optionally, the inputting the word vector corresponding to each chunk into the bidirectional recurrent neural network, and determining the dependency relationship between each chunk includes: inputting the text sequence into the bidirectional recurrent neural network to obtain an expression vector of each clause containing context semantics; processing the representation vector through a full connection layer to obtain a target vector; and determining the dependency relationship between each block according to the target vector.
According to another aspect of the embodiments of the present invention, there is also provided a text processing apparatus including: the segmentation unit is used for segmenting a target text according to a preset separator to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words; the first obtaining unit is used for inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk; the determining unit is used for inputting the word vector corresponding to each chunk into the bidirectional circulation neural network and determining the dependency relationship among the chunks; and the second obtaining unit is used for processing the chunks in the chunk set according to the dependency relationship to obtain the target text segmentation result.
Optionally, the second obtaining unit includes: and the second obtaining module is used for combining the chunks with the same dependency relationship to obtain the target text segmentation result.
Optionally, the first obtaining unit includes: and a first obtaining module, configured to segment the target text D into N words and phrases, and perform word vector encoding on the chunk set including the N words and phrases through the pre-training model to obtain a word vector in a N × m format for each word and phrase, where N represents the number of words in each word and a feature vector dimension of the m pre-training model.
Optionally, the apparatus further comprises: and the calculation unit is used for converting the word vector of each clause into a one-dimensional vector by using a device for calculating average pooling after the word vector coding is carried out through the pre-training model to obtain the word vector of each clause in an n-m format, so as to obtain the text sequence corresponding to the target text.
Optionally, the determining unit includes: the third obtaining module is used for inputting the text sequence into the bidirectional recurrent neural network to obtain the expression vector of each clause containing the context semantics; the processing module is used for processing the expression vector through a full connection layer to obtain a target vector; and the determining module is used for determining the dependency relationship among the chunks according to the target vector.
According to a first aspect of embodiments of the present application, a computer-readable storage medium is provided, wherein a computer program is stored in the storage medium, and the computer program is configured to execute the text processing method when running.
According to a first aspect of embodiments of the present application, there is provided an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the text processing method.
In the embodiment of the invention, firstly, a long text is separated and cut by taking a basic punctuation mark as a separation, and the cut clauses are modeled by a pre-training model respectively; and then, aiming at the model expression obtained in the previous step, a bidirectional recurrent neural network is used for establishing context dependence between the clauses, so that the block analysis is completed by taking the clauses as units. The embodiment of the invention breaks through the length limitation of the pre-training model, combines the advantage of context time sequence relation dependency between modeling clauses of the recurrent neural network model, and can better solve the problems of long text modeling and segmentation so as to at least solve the technical problem of lower text segmentation accuracy in the prior art.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative text processing method of an embodiment of the present invention;
FIG. 2 is a flow diagram of an alternative text processing method according to an embodiment of the invention;
FIG. 3 is a schematic diagram of an alternative model structure according to an embodiment of the invention;
fig. 4 is a diagram of an alternative text processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or otherwise described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a sequence of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The text processing method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar arithmetic device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a text processing method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the text processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
In this embodiment, a text processing method is further provided, and fig. 2 is a flowchart of the text processing method according to the embodiment of the present invention, as shown in fig. 2, the flow of the text processing method includes the following steps:
step S202, segmenting the target text according to preset separators to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words.
Step S204, each chunk in the chunk set is input into a pre-training model to obtain a word vector corresponding to each chunk.
And step S206, inputting the word vector corresponding to each chunk into the bidirectional recurrent neural network, and determining the dependency relationship among the chunks.
And S208, processing the chunks in the chunk set according to the dependency relationship to obtain a target text segmentation result.
In this embodiment, the pre-set characters on the market include, but are not limited to, basic punctuation marks, and the original text is cut by the basic punctuation marks. Based on the analysis of the medical text line character, the employed separators are comma (,), semicolon (;) and period (, in Chinese punctuation marks. For a long text D, the long text D is divided into N sentences: d ═ span1,…,spann]。
In the present embodiment, the target text may include, but is not limited to, a medical long text.
According to the embodiment provided by the application, the target text is segmented according to the preset separators, and the chunk set corresponding to the target text is obtained, wherein the chunks in the chunk set are composed of different words; inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk; inputting the word vector corresponding to each chunk into a bidirectional recurrent neural network, and determining the dependency relationship between each chunk; and processing the chunks in the chunk set according to the dependency relationship to obtain a target text segmentation result. The method at least solves the technical problem that in the prior art, the text segmentation accuracy is low.
That is, in this embodiment, the long text is cut apart by using the basic punctuation marks as partitions, and the cut clauses are modeled by using the pre-training models respectively; and then, aiming at the model expression obtained in the previous step, a bidirectional recurrent neural network is used for establishing context dependence between the clauses, so that the block analysis is completed by taking the clauses as units. The embodiment of the invention breaks through the length limitation of the pre-training model, combines the advantage of context time sequence relation dependency between modeling clauses of the recurrent neural network model, and can better solve the problems of long text modeling and segmentation so as to at least solve the technical problem of lower text segmentation accuracy in the prior art.
Optionally, processing the chunks in the chunk set according to the dependency relationship to obtain a target text segmentation result, which may include: and combining the chunks with the same dependency relationship to obtain a target text segmentation result.
Optionally, inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk, which may include: and under the condition that the target text D is segmented into N words and the chunk set comprises the N words, performing word vector coding through a pre-training model to obtain a word vector of each word in a N x m format, wherein N represents the number of words in each word and m represents the feature vector dimension of the pre-training model.
Optionally, after performing word vector coding by using the pre-training model to obtain a word vector in a n × m format for each word, the method may further include: and converting the word vector of each clause into a one-dimensional vector by using a method of calculating average pooling to obtain a text sequence corresponding to the target text.
Optionally, inputting the word vector corresponding to each chunk into the bidirectional recurrent neural network, and determining the dependency relationship between each chunk may include: inputting the text sequence into a bidirectional cyclic neural network to obtain an expression vector of each clause containing context semantics; processing the expression vector through a full connection layer to obtain a target vector; and determining the dependency relationship between each chunk according to the target vector.
As an alternative embodiment, the present application further provides a method for segmenting a chunk of a medical long text based on a pre-training model.
In the embodiment, firstly, a long text is separated and cut by using basic punctuation marks as partitions, and the cut clauses are modeled by using a pre-training model respectively; and then, aiming at the model expression obtained in the previous step, a bidirectional recurrent neural network is used for establishing context dependence between the clauses, so that the block analysis is completed by taking the clauses as a unit. The method breaks through the length limitation of the pre-training model, combines the advantage of context time sequence relation dependence between modeling clauses of the recurrent neural network model, and can well solve the problems of long text modeling and segmentation. As shown in fig. 3, the model is schematically constructed.
Step 1: and segmenting the long text (equivalent to the target text), and obtaining the segmented clause word vector code by using a pre-training model.
Step 1.1: and taking the basic punctuation marks as separators to segment the original text. Based on the analysis of the medical text line character, the separator symbols adopted by the Chinese are comma (,), semicolon (;) and period (#) in the Chinese punctuation mark. For a long text D, the long text D is divided into N sentences:
D=[span1,…,spann]
step 1.2: a word vector representation for each clause is obtained using a pre-trained language model. For each clause span containing n tokens [ token1,…,tokenn]And carrying out word vector coding through a pre-training language model to obtain the expression of the clause:
Espan=Bertemb(span)=[vt1,…,vtn]
wherein E isspanIs n × hiddenSize, which is the feature vector dimension of the pre-trained model, typically hiddenSize 768.
Step 2: long text is modeled using a pre-trained model and a recurrent neural network.
Step 2.1: each clause span is encoded using a pre-trained model:
hspan=Bertlayer(Espan)=[ht1,…,htn]
wherein, usually hspanShape of (E) andspanthe same is true.
Step 2.2: using a method of computational average pooling, the representation h of each clausespanInto a one-dimensional vector s of shape 1 × hiddenSize, so that the entire text D can be represented as a sequence of representations of one-dimensional vectors for each clause:
s=Pooling(hspan)
Spooling=[s1,…,sn]
step 2.3: the results of the average pooling were concatenated using a recurrent neural network. Transmitting the clause expression sequence obtained in the step 2.2 into a bidirectional recurrent neural network model based on a gated recurrent unit (BiGRU), and associating the expression of each clause with the adjacent clauses to obtain an expression sc containing context semantics of each clause:
sc=BiGRU(Spooling)=[sc1,sc2,…,scn]
step 2.4: converting the result obtained in step 2.3 into sc by using a full-connection neural network, and outputting a final representation s of the layerout
sout=FFN(sc)
Step 2.5: repeating the steps of 2.1-2.4 for L times to obtain semantic representation of the L layer of the long text
Figure BDA0003544040660000081
And 3, step 3: sequence tagging using conditional random fields at the top level of a model network
Step 3.1: for clause representation output in 2 steps
Figure BDA0003544040660000091
And (3) performing sequence labeling on a top layer by using a Conditional Random Field (CRF) to obtain a labeling result:
Lcrf=CRF(sout)
step 3.2: and merging continuous clauses belonging to the same classification to finally obtain a content-based segmentation result of the long text D:
D=[Part1,Part2,…,Partm]
due to the fact that the split clauses n are combined, the finally obtained chunk component m is not necessarily equal to the number n of the clauses.
In the present embodiment, first, from a long text perspective, the proposed method is not limited by the length of the text processed (typically pre-trained models are limited by a length of 512), which is often found in medical text. Secondly, compared with a BilSTM + crf sequence labeling method, the quality of the obtained text representation can be represented by abundant semantic information in the pre-training model, and the pre-training model can be finely adjusted in the training process, so that the representation capability of the model can be improved along with the training on the text in the medical field. Finally, the expression of the two-way recurrent neural network on the clauses ensures that the clause representation finally used for labeling contains context dependency, so that a segmentation result superior to that of a past method can be obtained.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus necessary general hardware platform, but may also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a text processing apparatus is further provided, and the text processing apparatus is used for implementing the foregoing embodiments and preferred embodiments, and the description of the text processing apparatus is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a structure of a text processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the text processing apparatus including:
the segmentation unit 41 is configured to segment the target text according to a preset delimiter to obtain a chunk set corresponding to the target text, where chunks in the chunk set are composed of different words.
The first obtaining unit 43 is configured to input each chunk in the chunk set into the pre-training model to obtain a word vector corresponding to each chunk.
And the determining unit 45 is used for inputting the word vector corresponding to each chunk into the bidirectional recurrent neural network and determining the dependency relationship between each chunk.
And the second obtaining unit 47 is configured to process the chunks in the chunk set according to the dependency relationship, so as to obtain a target text segmentation result.
According to the embodiment provided by the application, the segmentation unit 41 segments the target text according to the preset separators to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words; the first obtaining unit 43 inputs each chunk in the chunk set into the pre-training model to obtain a word vector corresponding to each chunk; the determining unit 45 inputs the word vector corresponding to each chunk into the bi-directional recurrent neural network, and determines the dependency relationship between each chunk; the second obtaining unit 47 processes the chunks in the chunk set according to the dependency relationship, so as to obtain a target text segmentation result. The method and the device at least solve the technical problem that the text segmentation accuracy is low in the prior art.
Optionally, the second obtaining unit 47 may include: and the second obtaining module is used for merging the chunks with the same dependence relationship to obtain a target text segmentation result.
Optionally, the first obtaining unit may include: the first obtaining module is used for carrying out word vector coding through a pre-training model under the condition that the target text D is segmented into N words and the chunk set comprises the N words, so as to obtain word vectors in N x m format of each word, wherein N represents the number of words in each word and m represents the dimension of a feature vector of the pre-training model.
Optionally, the apparatus may further include: and the calculation unit is used for performing word vector coding through the pre-training model to obtain a word vector in an n x m format of each clause, and then converting the word vector of each clause into a one-dimensional vector by using a device for calculating average pooling to obtain a text sequence corresponding to the target text.
Optionally, the determining unit 45 may include: a third obtaining module, configured to input the text sequence into the bidirectional recurrent neural network to which the text sequence belongs, and obtain an expression vector in which each clause includes context semantics; the processing module is used for processing the representation vector through the full connection layer to obtain a target vector; and the determining module is used for determining the dependency relationship between each chunk according to the target vector.
It should be noted that the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the following steps:
s1, segmenting the target text according to preset separators to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words;
s2, inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk;
s3, inputting the word vector corresponding to each chunk into a bidirectional recurrent neural network, and determining the dependency relationship between each chunk;
and S4, processing the chunks in the chunk set according to the dependency relationship to obtain a target text segmentation result.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention further provide an electronic device, comprising a memory having a computer program stored therein and a processor configured to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, segmenting the target text according to preset separators to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words;
s2, inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk;
s3, inputting the word vector corresponding to each chunk into a bidirectional recurrent neural network, and determining the dependency relationship between each chunk;
and S4, processing the chunks in the chunk set according to the dependency relationship to obtain a target text segmentation result.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and alternatively, they may be implemented in program code that is executable by a computing device, such that it may be stored in a memory device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that shown or described herein, or separately fabricated into individual integrated circuit modules, or multiple ones of them fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method of text processing, comprising:
segmenting a target text according to a preset separator to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words;
inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk;
inputting the word vector corresponding to each chunk into a bidirectional recurrent neural network, and determining the dependency relationship between the chunks;
and processing the chunks in the chunk set according to the dependency relationship to obtain the target text segmentation result.
2. The method according to claim 1, wherein the processing the chunks in the chunk set according to the dependency relationship to obtain the target text segmentation result includes:
and combining the chunks with the same dependency relationship to obtain the target text segmentation result.
3. The method of claim 1, wherein the inputting each chunk of the set of chunks into a pre-training model to obtain a word vector corresponding to each chunk comprises:
and under the condition that the target text D is segmented into N words, and the chunk set comprises the N words, carrying out word vector coding through the pre-training model to obtain a word vector of each word in an N x m format, wherein N represents the number of words in each word, and the feature vector dimension of the m pre-training model.
4. The method of claim 2, wherein after the word vector encoding by the pre-training model to obtain the word vector in n × m format for each word, the method further comprises:
and converting the word vector of each clause into a one-dimensional vector by using a method of calculating average pooling to obtain a text sequence corresponding to the target text.
5. The method according to claim 1, wherein inputting the word vector corresponding to each chunk into a bidirectional recurrent neural network, and determining the dependency relationship between each chunk comprises:
inputting the text sequence into the bidirectional recurrent neural network to obtain an expression vector of each clause containing context semantics;
processing the expression vector through a full connection layer to obtain a target vector;
and determining the dependency relationship between each chunk according to the target vector.
6. A text processing apparatus, comprising:
the segmentation unit is used for segmenting a target text according to a preset separator to obtain a chunk set corresponding to the target text, wherein chunks in the chunk set are composed of different words;
the first obtaining unit is used for inputting each chunk in the chunk set into a pre-training model to obtain a word vector corresponding to each chunk;
the determining unit is used for inputting the word vector corresponding to each chunk into the bidirectional recurrent neural network and determining the dependency relationship among the chunks;
and the second obtaining unit is used for processing the chunks in the chunk set according to the dependency relationship to obtain the target text segmentation result.
7. The apparatus of claim 6, wherein the second obtaining unit comprises:
and the second obtaining module is used for combining the chunks with the same dependency relationship to obtain the target text segmentation result.
8. The apparatus of claim 6, wherein the first obtaining unit comprises:
and a first obtaining module, configured to segment the target text D into N words and phrases, and perform word vector encoding on the chunk set including the N words and phrases through the pre-training model to obtain a word vector in an N × m format for each word and phrase, where N represents the number of words in each word and a feature vector dimension of the m pre-training model.
9. The apparatus of claim 7, further comprising:
and the calculation unit is used for converting the word vector of each clause into a one-dimensional vector by using a device for calculating average pooling after the word vector coding is carried out through the pre-training model to obtain the word vector of each clause in an n-m format, so as to obtain the text sequence corresponding to the target text.
10. The apparatus of claim 6, wherein the determining unit comprises:
a third obtaining module, configured to input the text sequence into the bidirectional recurrent neural network to which the text sequence belongs, and obtain an expression vector in which each clause includes context semantics;
the processing module is used for processing the expression vector through a full connection layer to obtain a target vector;
and the determining module is used for determining the dependency relationship among the chunks according to the target vector.
11. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 5 when executed.
12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.
CN202210243644.8A 2022-03-12 2022-03-12 Text processing method, text processing device, storage medium and electronic device Pending CN114707487A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210243644.8A CN114707487A (en) 2022-03-12 2022-03-12 Text processing method, text processing device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210243644.8A CN114707487A (en) 2022-03-12 2022-03-12 Text processing method, text processing device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN114707487A true CN114707487A (en) 2022-07-05

Family

ID=82168896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210243644.8A Pending CN114707487A (en) 2022-03-12 2022-03-12 Text processing method, text processing device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114707487A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186649A (en) * 2022-08-30 2022-10-14 北京睿企信息科技有限公司 Relational model-based segmentation method and system for ultra-long text
CN117216208A (en) * 2023-09-01 2023-12-12 北京开普云信息科技有限公司 Question and answer method, device, storage medium and equipment based on long document

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186649A (en) * 2022-08-30 2022-10-14 北京睿企信息科技有限公司 Relational model-based segmentation method and system for ultra-long text
CN115186649B (en) * 2022-08-30 2023-01-06 北京睿企信息科技有限公司 Relational model-based segmentation method and system for ultra-long text
CN117216208A (en) * 2023-09-01 2023-12-12 北京开普云信息科技有限公司 Question and answer method, device, storage medium and equipment based on long document

Similar Documents

Publication Publication Date Title
CN111222305B (en) Information structuring method and device
CN111090736B (en) Question-answering model training method, question-answering method, device and computer storage medium
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN114707487A (en) Text processing method, text processing device, storage medium and electronic device
CN112052687B (en) Semantic feature processing method, device and medium based on depth separable convolution
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN110619051A (en) Question and sentence classification method and device, electronic equipment and storage medium
CN112580328A (en) Event information extraction method and device, storage medium and electronic equipment
CN113486173B (en) Text labeling neural network model and labeling method thereof
CN112348111A (en) Multi-modal feature fusion method and device in video, electronic equipment and medium
CN116956929B (en) Multi-feature fusion named entity recognition method and device for bridge management text data
CN112948575B (en) Text data processing method, apparatus and computer readable storage medium
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN110348012B (en) Method, device, storage medium and electronic device for determining target character
CN113434631B (en) Emotion analysis method and device based on event, computer equipment and storage medium
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN113935312A (en) Long text matching method and device, electronic equipment and computer readable storage medium
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN115310429B (en) Data compression and high-performance calculation method in multi-round listening dialogue model
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN114722774A (en) Data compression method and device, electronic equipment and storage medium
CN113535946A (en) Text identification method, device and equipment based on deep learning and storage medium
CN113051869A (en) Method and system for identifying text difference content by combining semantic recognition
CN115526176A (en) Text recognition method and device, electronic equipment and storage medium
CN112597757A (en) Word detection method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination