WO2021213155A1 - 文本添加标点的方法、装置、介质及电子设备 - Google Patents

文本添加标点的方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2021213155A1
WO2021213155A1 PCT/CN2021/084169 CN2021084169W WO2021213155A1 WO 2021213155 A1 WO2021213155 A1 WO 2021213155A1 CN 2021084169 W CN2021084169 W CN 2021084169W WO 2021213155 A1 WO2021213155 A1 WO 2021213155A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
relationship
word
vector
text
Prior art date
Application number
PCT/CN2021/084169
Other languages
English (en)
French (fr)
Inventor
颜泽龙
王健宗
吴天博
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021213155A1 publication Critical patent/WO2021213155A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the field of artificial intelligence technology, and in particular relates to a method, device, medium and electronic equipment for adding punctuation to text.
  • the present application provides a method, device, medium and electronic equipment for adding punctuation to text, which can improve the accuracy of adding punctuation to a certain extent.
  • a method for adding punctuation to text includes: obtaining a text to be added, segmenting the text to be added to obtain a plurality of words; obtaining a relationship between each word in the plurality of words , Obtain the dependent words of each word and the relationship between each word and its dependent words; based on the each word, the dependent words of each word, and the relationship between each word and its dependent words, determine the each The relationship vector of the words; acquiring the relationship between the relationship vectors of the plurality of words; based on the relationship between the relationship vectors, adding punctuation between the plurality of words.
  • a text adding punctuation device including: an obtaining module configured to obtain a text to be added, segment the text to be added to obtain a plurality of words; and obtain the plurality of words The relationship between each word is obtained, and the dependent word of each word and the relationship between each word and its dependent word are obtained; the determining module is configured to be based on each word, the dependent word of each word, and each word and its Relying on the relationship between the words, determine the relationship vector of each word; the adding module is configured to obtain the relationship between the relationship vectors of the multiple words, and based on the relationship between the relationship vectors, in the multiple Add punctuation between words.
  • a computer-readable program medium which stores computer program instructions, and when the computer program instructions are executed by a computer, the at least one instruction is executed when the at least one instruction is executed by a processor.
  • an electronic device including: a processor; a memory, where computer-readable instructions are stored, and the processor executes the following steps when executing the computer-readable instructions: Obtain the text to be added, segment the text to be added to obtain multiple words; obtain the relationship between each word in the multiple words, obtain the dependent words of each word and the relationship between each word and its dependent words Determine the relationship vector of each word based on each word, the dependent words of each word and the relationship between each word and its dependent words; obtain the relationship between the relationship vectors of the multiple words; Based on the relationship between the relationship vectors, punctuation is added between the plurality of words.
  • the text to be added is segmented to obtain multiple words, the relationship between each word in the multiple words is obtained, and the dependent words and the words of each word are obtained.
  • the relationship between each word and its dependent words based on each word, each word's dependent words, and the relationship between each word and its dependent words, determine the relationship vector of each word, and obtain the relationship between the relationship vectors of multiple words, based on The relationship between the relationship vectors, adding punctuation between multiple words, taking into account the relationship between words in the text to be added, and taking into account the relationship between words and text in the text to be added, to a certain extent Improve the accuracy of punctuation.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application can be applied;
  • FIG. 2 schematically shows a flowchart of a method for adding punctuation to a text according to an embodiment of the present application
  • Fig. 3 schematically shows a structural diagram of a system for adding punctuation to text according to an embodiment of the present application
  • Fig. 4 schematically shows a block diagram of a text punctuation device according to an embodiment of the present application
  • Fig. 5 is a hardware diagram of an electronic device according to an exemplary embodiment
  • Fig. 6 is a computer-readable storage medium for implementing a method according to an exemplary embodiment.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture 100 to which the technical solutions of the embodiments of the present application can be applied.
  • the system architecture 100 may include a terminal device 101 (which may be one or more of a smart phone, a tablet computer, and a portable computer, of course, it may also be a desktop computer, etc.), a network 102 and a server 103.
  • the network 102 is used to provide a medium of a communication link between the terminal device 101 and the server 103.
  • the network 102 may include various connection types, such as wired communication links, wireless communication links, and so on.
  • the numbers of the terminal device 101, the network 102, and the server 103 in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices 101, networks 102, and servers 103.
  • the server 103 may be a server cluster composed of multiple servers.
  • the server 103 obtains the text to be added by segmenting the text to be added to obtain multiple words, obtains the relationship between each word in the multiple words, and obtains the dependent words of each word and each word and its The relationship between dependent words, based on each word, each word's dependent word and the relationship between each word and its dependent word, determine the relationship vector of each word, obtain the relationship between the relationship vectors of multiple words, based on the relationship vector Punctuation is added between multiple words, taking into account the relationship between words in the text to be added, and the relationship between words and text in the text to be added, which can improve punctuation to a certain extent Accuracy.
  • the text adding punctuation method provided in the embodiments of the present application is generally executed by the server 103, and accordingly, the text adding punctuation device is generally set in the middle server 103.
  • the terminal device 101 may also have similar functions as the server 103, so as to execute the method for adding punctuation to the text provided by the embodiments of the present application.
  • FIG. 2 schematically shows a flow chart of a method for adding punctuation to a text according to an embodiment of the fundamental application.
  • the execution subject of the method for adding punctuation to the text may be a server, for example, the server 103 shown in FIG. 1.
  • the method for adding punctuation to text includes at least step S210 to step S250, which are described in detail as follows:
  • step S210 the text to be added is obtained, and the text to be added is segmented to obtain multiple words.
  • the text to be added may be segmented in the order of the text to obtain the first word segmentation result; the text to be added may be segmented in the reverse order of the text to obtain the second word segmentation result; the first word segmentation result and the second word segmentation result may be obtained.
  • the difference between the word segmentation results, the text to be added corresponding to the difference is segmented from the middle to the two sides, and the difference result is obtained; the difference between the first word segmentation result and the second word segmentation result is replaced by the difference result, and the replaced first One participle results as multiple words.
  • the meaningless characters in the text to be added can be filtered and segmented.
  • each word in the text to be added can be recognized, and each word and the word similar to the word can be combined and compared to the preset word list, and the words in the preset word list Word segmentation.
  • the meaning of each character can be obtained. If the meanings of adjacent characters can be combined, then the character and the adjacent characters are regarded as one word.
  • the text to be added may be input into a pre-trained word segmentation model to obtain multiple words output by the word segmentation model.
  • step S220 the relationship between each word in the plurality of words is obtained, and the dependent words of each word and the relationship between each word and its dependent words are obtained.
  • the part of speech and position of each word can be obtained; according to the part of speech and position of each word, the relationship between each word is determined.
  • the word meaning of each word can be obtained, and the relationship between multiple words can be determined according to the word meaning.
  • the word relation table can be searched according to the word and any word in the plurality of words respectively, and the dependent word and the word related to the word can be obtained.
  • the relationship with dependent words can be searched according to the word and any word in the plurality of words respectively.
  • multiple words may be input to the pre-trained relationship acquisition model to obtain the relationship between the multiple words output by the relationship acquisition model.
  • the relationship acquisition model may be a syntactic dependency tree model.
  • the dependency relationship may include: a principal-subordinate relationship, a verb-object relationship, a passive relationship, a subordination relationship, a fixed collocation, an apposition, an adjective, and the like.
  • a label may be set for each dependency relationship, so as to facilitate the generation of a vector according to the label in the following.
  • step S230 the relationship vector of each word is determined based on each word, each word's dependent words, and the relationship between each word and its dependent words.
  • the first vector obtained based on each word can be obtained; the second vector obtained based on the dependent words of each word can be obtained; the third vector obtained based on the relationship between each word and its dependent words can be obtained; Combine the first vector, the second vector, and the third vector to obtain the relationship vector corresponding to each word.
  • each word may be encoded to obtain the first sequence; the dependent words of each word may be encoded to obtain the second sequence; the relationship between each word and its dependent words may be encoded to obtain the third sequence;
  • the first sequence, the second sequence, and the third sequence are truncated or zero-filled, the truncated or zero-filled first sequence is mapped to the first vector, and the truncated or zero-filled second sequence is mapped to the second vector, Map the third sequence after truncation or zero padding to a third vector.
  • the first sequence, the second sequence, and the third sequence may be truncated from front to back.
  • step S240 the relationship between the relationship vectors of a plurality of words is obtained.
  • the relationship vectors of multiple words can be input into the pre-trained attention model to obtain the relationship between the relationship vectors of multiple words.
  • the pre-trained attention model can fully consider each relationship. The relationship between the vectors.
  • step S250 based on the relationship between the relationship vectors, punctuation is added between the multiple words.
  • a conditional random field can be used to add punctuation between multiple words.
  • punctuation can be added between multiple words to obtain multiple adding methods; feature extraction is performed on the relationship between the relationship vectors through the bidirectional LSTM layer; based on the feature, the Viterbi algorithm is used to calculate The probability of various addition methods is based on the addition method with the highest probability among multiple methods to add punctuation between multiple words.
  • the bidirectional LSTM layer can perform deeper specific extraction of the text to obtain the feature output vector N*K of the text, where K is the number of neurons in the LSTM layer.
  • the text to be added is segmented to obtain multiple words, and the relationship between each word in the multiple words is obtained, and the dependent words of each word and the relationship between each word and its dependent words are obtained.
  • the relationship between each word based on each word, each word's dependent words, and the relationship between each word and its dependent words, determine the relationship vector of each word, and obtain the relationship between the relationship vectors of multiple words, based on the relationship between the relationship vectors , Adding punctuation between multiple words, taking into account the relationship between words in the text to be added, and considering the relationship between words and text in the text to be added, which can improve the accuracy of punctuation to a certain extent .
  • the method of adding punctuation to the text of the present application adds punctuation to the Chinese text that lacks sentence boundary information, supplements necessary sentence structure information, improves the readability of the text, and thereby improves the effect of downstream natural language processing tasks.
  • the present application provides a system for adding punctuation to text.
  • the system for adding punctuation to text applies the method of adding punctuation to the text of this application to process the text to be punctuated.
  • FIG. 3 schematically shows A schematic structural diagram of a text punctuation system according to an embodiment of the present application.
  • the text punctuation system may include an input module (Input), a syntax dependency tree module (Dependency tree), a merge module (concat), Attention module (Attention), feature extraction module (BiLSTM), conditional random field module (CRF) and output module (Output).
  • the process of applying the method of adding punctuation to the text of this application to process the medical information text can include: input module, the length of the medical information text can be l 1
  • the word segmentation through the syntactic dependency tree module, the sequence in the unit of the word can be obtained through the word segmentation ,The length is l 2, the sequence length in words will be shorter than the text length in words.
  • extract the syntactic relationship of the whole sentence you can extract the dependent words of each word and the relationship between each word and its dependent words, and through the integration of the obtained syntactic relationship, you can obtain the related dependent words and Corresponding semantic relationship.
  • [Xiao Zhang’s doctor is Xiao Li]
  • the length is 8
  • the predicted label of each position is [BESBESBE]
  • the length of [Xiao Zhang’s doctor is Xiao Li] is 5
  • the text sequence in units of words, and then through the grammatical dependency tree, a total of 5 triples at each position are obtained, such as (xiaozhang, doctor, 1), ( ⁇ , doctor, 2), (doctor, yes, 3), ( Yes, yes, 4), (Xiao Li, yes, 5).
  • the corresponding dependent word sequence can be obtained as [Doctor, doctor is yes], and the corresponding semantic relationship sequence [1 2 3 4 5], where each number represents a type There are dozens of specific semantic relations, including master-slave-passive relations, fixed collocations and so on. If the word at the current position is at the root position of the syntactic dependency number, then its corresponding related word is itself, such as [Yes] in the example, and the corresponding relationship is also marked as [root].
  • the process of applying the method of adding punctuation to the text of the present application to process the medical information text may also include: obtaining the semantic vector according to the semantic relationship (refer to the above for obtaining the first vector, the second vector, and the third vector Steps), the semantic vector length is standardized, the standard length is set to N, the length exceeds N for truncation, only the first N words are retained, and the length is less than N is zero-filled, and three sequences of length N are obtained.
  • the first vector, the second vector, and the third vector can be merged together by the merging module.
  • Each word embedding vector has M dimensions, so an N*3M vector will be obtained.
  • the process of applying the method of adding punctuation to the text of the present application to process medical information text may also include: taking the vector extracted by the neural network as input through the conditional random field, and using the Viterbi algorithm to calculate each To predict the probability between paths, select the one with the largest probability value as the result of the punctuation prediction pair.
  • the predictive text [Xiao Zhang’s doctor is Xiao Li], there are a total of 5 locations that need to be predicted, and there are 3 possibilities for each location.
  • the method of adding punctuation to the text of this application is used to process medical information text, and the Chinese punctuation prediction based on the syntactic dependency tree and the attention mechanism uses both the feature extraction capabilities of LSTM pairs in the neural network and the construction of the output sequence using the conditional random field.
  • the ability to model, the clever use of the syntactic dependency tree and the attention mechanism can fully consider the connection between words, excavate the semantic relationship information in it as much as possible, and treat the whole sentence as a whole, taking into account
  • the rationality of the overall forecast is obvious in actual use due to the existing model.
  • This application can automatically add punctuation marks to the text and supplement necessary sentence structure information, which will greatly improve the effect of subsequent natural language processing tasks.
  • Fig. 4 schematically shows a block diagram of a text punctuation device according to an embodiment of the present application.
  • a text adding punctuation device 400 includes an acquiring module 401, a determining module 402, and an adding module 403.
  • the obtaining module 401 is configured to obtain the text to be added, and to segment the text to be added to obtain multiple words; obtain the relationship between each word in the multiple words to obtain The dependent words of each word and the relationship between each word and its dependent words; the determining module 402 is configured to determine the relationship vector of each word based on each word, the dependent word of each word, and the relationship between each word and its dependent word; add a module 403 is configured to obtain the relationship between the relationship vectors of multiple words, and add punctuation between the multiple words based on the relationship between the relationship vectors.
  • the acquisition module 401 is configured to: segment the text to be added in the order of the text to obtain the first word segmentation result; segment the text to be added in the reverse order of the text to obtain the second word segmentation result ; Obtain the difference between the first segmentation result and the second segmentation result, segment the text to be added corresponding to the difference from the middle to both sides to obtain the difference result; replace the difference between the first segmentation result and the second segmentation result For the difference result, the first word segmentation result after the replacement is regarded as multiple words.
  • the acquisition module 401 is configured to: acquire the part of speech and position of each word; and determine the relationship between each word according to the part of speech and position of each word.
  • the determining module 402 is configured to: obtain a first vector obtained based on each word; obtain a second vector obtained based on a dependent word of each word; obtain a difference between each word and its dependent word
  • the third vector obtained by the relationship between the first vector, the second vector, and the third vector are combined to obtain the relationship vector corresponding to each word.
  • the determining module 402 is configured to: encode each word to obtain the first sequence; encode the dependent words of each word to obtain the second sequence;
  • the third sequence is obtained by encoding the relationship of the first sequence, the second sequence, and the third sequence.
  • the first sequence after the truncation or zero-filling is mapped to the first vector, and the truncated or zero-filled
  • the second sequence is mapped to the second vector, and the truncated or zero-padded third sequence is mapped to the third vector.
  • the adding module 403 is configured to input the relationship vectors of multiple words into the pre-trained attention model to obtain the relationship between the relationship vectors of the multiple words.
  • the adding module 403 is configured to: add punctuation between multiple words to obtain multiple ways of adding; and perform feature extraction on the relationship between the relationship vectors through the bidirectional LSTM layer ; Based on the characteristics, the Viterbi algorithm is used to calculate the probability of various addition methods, and punctuation is added between multiple words based on the addition method with the highest probability among multiple methods.
  • the electronic device 50 according to this embodiment of the present application will be described below with reference to FIG. 5.
  • the electronic device 50 shown in FIG. 5 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the electronic device 50 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 50 may include, but are not limited to: the aforementioned at least one processing unit 51, the aforementioned at least one storage unit 52, a bus 53 connecting different system components (including the storage unit 52 and the processing unit 51), and a display unit 54.
  • the storage unit stores program code, and the program code can be executed by the processing unit 51, so that the processing unit 51 executes the various exemplary methods described in the above-mentioned "Embodiment Method" section of this specification. Steps of implementation.
  • the storage unit 52 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 521 and/or a cache storage unit 522, and may further include a read-only storage unit (ROM) 523.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 52 may also include a program/utility tool 524 having a set (at least one) program module 525.
  • program module 525 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 53 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 50 may also communicate with one or more external devices (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 50, and/or communicate with
  • the electronic device 50 can communicate with any device (such as a router, a modem, etc.) that communicates with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 55.
  • the electronic device 50 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 56. As shown in the figure, the network adapter 56 communicates with other modules of the electronic device 50 through the bus 53.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method in this specification.
  • each aspect of the present application can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 60 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • CD-ROM compact disk read-only memory
  • the program product of this application is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product may adopt any combination of one or more computer-readable storage media.
  • the computer-readable storage medium may be a readable signal medium or a readable storage medium, and the computer-readable storage medium may be nonvolatile or volatile.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the above.
  • Non-exhaustive list of computer-readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory Erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of the present application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

一种文本添加标点方法、装置、介质及电子设备,涉及人工智能领域。该方法包括:获取待添加文本,将待添加文本进行分词得到多个词语(210),获取多个词语中各个词语之间的关系,得到各个词语的依赖词及各个词语与其依赖词之间的关系(220),基于各个词语、各个词语的依赖词及各个词语与其依赖词之间的关系,确定各个词语的关系向量(230),获取多个词语的关系向量之间的关系(240),基于关系向量之间的关系,在多个词语之间添加标点(250),考虑到了待添加文本中词与词之间的关系,并且考虑到了待添加文本中词与文本之间的关系,能够在一定程度上提高标点添加的准确性。

Description

文本添加标点的方法、装置、介质及电子设备
本申请要求于2020年11月25日提交中国专利局、申请号为202011344671.1,申请名称为“文本添加标点的方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,特别涉及一种文本添加标点的方法、装置、介质及电子设备。
背景技术
随着人工智能的不断发展,各种深度学习营运而生。在当下,无论是语音识别生成的文本,还是各种社交网络语料,都是不带任何标点符号的文本。由于缺乏必要的句子边界和标点信息,文本的可读性较低,对下游的一些自然语言处理任务,如意图识别、命名实体识别等有一定的影响。发明人意识到现有的标点添加方法需要人为构建特征作为输入,没有考虑到待添加文本自身的特征,添加的标点不够准确。
技术问题
根据上述技术问题,本申请提供一种文本添加标点的方法、装置、介质及电子设备,其能够在一定程度上提高标点添加的准确性。
技术解决方案
根据本申请实施例的一个方面,提供了一种文本添加标点方法包括:获取待添加文本,将所述待添加文本进行分词得到多个词语;获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;获取所述多个词语的关系向量之间的关系;基于所述关系向量之间的关系,在所述多个词语之间添加标点。
根据本申请实施例的一个方面,提供了一种文本添加标点装置,包括:获取模块,配置为获取待添加文本,将所述待添加文本进行分词得到多个词语;获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;确定模块,配置为基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;添加模块,配置为获取所述多个词语的关系向量之间的关系,基于所述关系向量之间的关系,在所述多个词语之间添加标点。
根据本申请实施例的一个方面,提供了一种计算机可读程序介质,其存储有计算机程序指令,当所述计算机程序指令被计算机执行时,所述至少一个指令被处理器执行时实现时执行以下步骤:获取待添加文本,将所述待添加文本进行分词得到多个词语;获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;获取所述多个词语的关系向量之间的关系;基于所述关系向量之间的关系,在所述多个词语之间添加标点。
根据本申请实施例的一个方面,提供了一种电子装置,包括:处理器;存储器,所述存储器上存储有计算机可读指令,所述处理器执行所述计算机可读指令时执行以下步骤:获取待添加文本,将所述待添加文本进行分词得到多个词语;获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;获取所述多个词语的关系向量之间的关系;基于所述关系向量之间的关系,在所述多个词语之间添加标点。
有益效果
在本申请的一些实施例所提供的技术方案中,通过获取待添加文本,将待添加文本进行分词得到多个词语,获取多个词语中各个词语之间的关系,得到各个词语的依赖词及各个词语与其依赖词之间的关系,基于各个词语、各个词语的依赖词及各个词语与其依赖词之间的关系,确定各个词语的关系向量,获取多个词语的关系向量之间的关系,基于关系向量之间的关系,在多个词语之间添加标点,考虑到了待添加文本中词与词之间的关系,并且考虑到了待添加文本中词与文本之间的关系,能够在一定程度上提高标点添加的准确性。
附图说明
图1示出了可以应用本申请实施例的技术方案的示例性系统架构示意图;
图2示意性示出了本申请的一个实施例的文本添加标点方法的流程图;
图3示意性示出了本申请的一个实施例的文本添加标点的系统的结构示意图;
图4示意性示出了根据本申请的一个实施例的文本添加标点装置的框图;
图5是根据一示例性实施例示出的一种电子装置的硬件图;
图6是根据一示例性实施例示出的一种用于实现方法的计算机可读存储介质。
本发明的实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
图1示出了可以应用本申请实施例的技术方案的示例性系统架构100示意图。
如图1所示,系统架构100可以包括终端设备101(可以是智能手机、平板电脑和便携式计算机中的一种或多种,当然也可以是台式计算机等等)、网络102和服务器103。网络102用以在终端设备101和服务器103之间提供通信链路的介质。网络102可以包括各种连接类型,例如有线通信链路、无线通信链路等等。
应该理解,图1中的终端设备101、网络102和服务器103的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备101、网络102和服务器103。比如服务器103可以是多个服务器组成的服务器集群等。
在本申请的一个实施例中,服务器103通过获取待添加文本,将待添加文本进行分词得到多个词语,获取多个词语中各个词语之间的关系,得到各个词语的依赖词及各个词语与其依赖词之间的关系,基于各个词语、各个词语的依赖词及各个词语与其依赖词之间的关系,确定各个词语的关系向量,获取多个词语的关系向量之间的关系,基于关系向量之间的关系,在多个词语之间添加标点,考虑到了待添加文本中词与词之间的关系,并且考虑到了待添加文本中词与文本之间的关系,能够在一定程度上提高标点添加的准确性。
需要说明的是,本申请实施例所提供的文本添加标点方法一般由服务器103执行,相应地,文本添加标点装置一般设置于中服务器103。但是,在本申请的其它实施例中,终端设备101也可以与服务器103具有相似的功能,从而执行本申请实施例所提供的文本添加标点方法。
以下对本申请实施例的技术方案的实现细节进行详细阐述:
图2示意性示出了根本申请的一个实施例的文本添加标点方法的流程图,该文本添加标点方法的执行主体可以是服务器,比如可以是图1中所示的服务器103。
参照图2所示,该文本添加标点方法至少包括步骤S210至步骤S250,详细介绍如下:
在步骤S210中,获取待添加文本,将待添加文本进行分词得到多个词语。
在本申请的一个实施例中,可以将待添加文本按照文本顺序进行分词,得到第一分词结果;将待添加文本按照文本倒序进行分词,得到第二分词结果;获取第一分词结果与第二分词结果之间的差异,将差异对应的待添加文本从中间到两边进行分词,得到差异结果;将第一分词结果中与第二分词结果之间的差异替换为差异结果,将替换后的第一分词结果作为多个词语。
在本申请的一个实施例中,可以对待添加文本中的无意义字符进行过滤后分词。
在本申请的一个实施例中,可以识别所述待添加文本中的每个字,获取每个字和与该字相近的字结合后对照预设的词语表,将在预设词语表中的词语进行分词。
在本申请的一个实施例中,可以获取每个字的字义,若相邻的字的字义能够结合,则将该字和相邻的字作为一个词语。
在本申请的一个实施例中,可以将待添加文本输入预训练的分词模型,得到分词模型输出的多个词语。
在步骤S220中,获取多个词语中各个词语之间的关系,得到各个词语的依赖词及各个词语与其依赖词之间的关系。
在本申请的一个实施例中,可以获取各个词语的词性和位置;根据各个词语的词性和位置,确定各个词语之间的关系。
在本申请的一个实施例中,可以获取每个词语的词义,根据词义确定多个词语之间的关系。
在本申请的一个实施例中,对于多个词语中的每个词语,可以分别根据该词语和多个词语中的任一词语查找词语关系表,得到与该词语具有关联的依赖词以及该词语与依赖词之间的关系。
在本申请的一个实施例中,可以将多个词语输入预训练的关系获取模型,得到关系获取模型输出的多个词语之间的关系。
在本申请的一个实施例中,关系获取模型可以是句法依存树模型。
在本申请的一个实施例中,依赖关系可以包括:主从关系、动宾关系、被动关系、从属关系、固定搭配、同位语、形容词等。
在本申请的一个实施例中,可以为每一种依赖关系设定一个标签,以方便下文根据标签生成向量。
在步骤S230中,基于各个词语、各个词语的依赖词及各个词语与其依赖词之间的关系,确定各个词语的关系向量。
在本申请的一个实施例中,可以获取基于各个词语得到的第一向量;获取基于各个词语的依赖词得到的第二向量;获取基于各个词语与其依赖词之间的关系得到的第三向量;将第一向量、第二向量和第三向量进行组合,得到各个词语对应的关系向量。
在本申请的一个实施例中,可以将各个词语进行编码得到第一序列;将各个词语的依赖词进行编码得到第二序列;将各个词语与其依赖词之间的关系进行编码得到第三序列;对第一序列、第二序列和第三序列进行截断或补零,将截断或补零后的第一序列映射为第一向量,将截断或补零后的第二序列映射为第二向量,将截断或补零后的第三序列映射为第三向量。
在本申请的一个实施例中,可以将第一序列、第二序列和第三序列从前向后进行截断。
在步骤S240中,获取多个词语的关系向量之间的关系。
在本申请的一个实施例中,可以将多个词语的关系向量输入预训练的注意力模型,得到多个词语的关系向量之间的关系,预训练的注意力模型能够充分考虑到每个关系向量之间的关系。
在步骤S250中,基于关系向量之间的关系,在多个词语之间添加标点。
在本申请的一个实施例中,可以通过条件随机场在多个词语之间添加标点。
在本申请的一个实施例中,可以将标点添加在多个词语之间,以得到多种添加方式;通过双向LSTM层对关系向量之间的关系进行特征提取;基于特征,利用维特比算法计算各种添加方式的概率,基于多种方式中概率最大的添加方式在多个词语之间添加标点。
在该实施例中,双向LSTM层能够对文本进行更深层次的特定提取,得到文本的特征输出向量N*K,K是LSTM层是神经元的个数。
例如:假设在某某场景中,预测的标点类型有三类,无标点,逗号,句号。对于预测文本【小张的医生是小李】,总共有5个位置需要预测,没每个位置有3中可能情况,总共有3*3*3*3**3种预测结果,假设【无标点,无标点,无标点,无标点,句号】是其中概率值最大的结果,则最终的预测结果为【小张的医生是小李。】
在图2的实施例中,通过获取待添加文本,将待添加文本进行分词得到多个词语,获取多个词语中各个词语之间的关系,得到各个词语的依赖词及各个词语与其依赖词之间的关系,基于各个词语、各个词语的依赖词及各个词语与其依赖词之间的关系,确定各个词语的关系向量,获取多个词语的关系向量之间的关系,基于关系向量之间的关系,在多个词语之间添加标点,考虑到了待添加文本中词与词之间的关系,并且考虑到了待添加文本中词与文本之间的关系,能够在一定程度上提高标点添加的准确性。
本申请的文本添加标点的方法,通过对缺乏句子边界信息的中文文本添加标点符号,补充必要的句子结构信息,提高文本的可读性,进而提高下游自然语言处理任务的效果。
在本申请的一个实施例中,本申请提供了一种文本添加标点的系统,文本添加标点的系统应用本申请的文本添加标点的方法处理医待添加标点的文本,图3示意性示出了本申请的一个实施例的文本添加标点的系统的结构示意图,如图3所示,文本添加标点的系统可以包括输入模块(Input)、句法依存树模块(Dependency tree)、合并模块(concat)、注意力模块(Attention)、特征提取模块(BiLSTM)、条件随机场模块(CRF)和输出模块(Output)。
应用本申请的文本添加标点的方法处理医学信息文本的过程可以包括:输入模块,医学信息文本医学信息文本的长度可以为 l1通过句法依存树模块分词,经过分词可以获得以词为单位的序列,长度为 l2,以词为单位的序列长度会短于以字为单位的文本长度。然后再提取整句话的句法关系,可以提取到各个词的依赖词及各个词和其依赖词之间的关系,通过对得到的句法关系的整合,可以获取到每个词的相关依赖词和对应的语义关系。
例如:【小张的医生是小李】,长度为8,经过分词,各个位置的预测标签为【B E S B E S B E】,通过对标签的整合,得到【小张的医生是小李】长度为5的以词为单位的文本序列,然后经过语法依存树,得到各个位置三元组总共5个,如(小张,医生,1),(的,医生,2),(医生,是,3),(是,是,4),(小李,是,5)。经过对上述三元组的整合,可以获得对应的相关依赖词序列为【医生医生是是是】,和相应的语义关系序列【1 2 3 4 5】,这里的每一个数字代表着一种类型的语义关系,具体的语义关系总共有几十种,包括主从被动关系,固定搭配等等。如果当前位置的词处于句法依存数的根位置,那么它相应的相关词就是它自身,如样例中的【是】,对应的关系我们另外标记为【root】。
在本申请的一个实施例中,应用本申请的文本添加标点的方法处理医学信息文本的过程还可以包括:根据语义关系得到语义向量(参照上文获取第一向量、第二向量、第三向量的步骤),将语义向量进行长度标准化,标准长度设置为N,长度超过N进行截断,只保留前N个字,长度少于N的进行补零,得到三个长度为N的序列,将根据各个词语得到的第一向量(Word Emb),根据各个词语的依赖词得到的第二向量(Parent Emb),根据各个词语与其依赖词之间的关系得到的第三向量(Relation Emb)。
在本申请的一个实施例中,可以将第一向量、第二向量和第三向量通过合并模块合并到一起,每一个词嵌入向量是M维度,所以会得到一个N*3M的向量。
在本申请的一个实施例中,应用本申请的文本添加标点的方法处理医学信息文本的过程还可以包括:通过条件随机场,将神经网络提取得到的向量作为输入,利用维特比算法,计算各个预测路径之间的概率,选取其中概率值最大的作为标点预测对的结果。假设在某某场景中,预测的标点类型有三类,无标点,逗号,句号。对于预测文本【小张的医生是小李】,总共有5个位置需要预测,没每个位置有3中可能情况,总共有3*3*3*3**3种预测结果,假设【无标点,无标点,无标点,无标点,句号】是其中概率值最大的结果,则最终的预测结果为【小张的医生是小李。】。
应用本申请的文本添加标点的方法处理医学信息文本,基于句法依存树和注意力机制的中文标点预测,既利用神经网络里面LSTM对的特征提取能力,有利用了条件随机场对输出序列的建模能力,还能巧妙的运用了句法依存树和注意力机制,能够充分考虑到词与词之间的联系,尽可能的挖掘其中的语义关系信息,将整句话当作一个整体,考虑到整体预测的合理性,在实际利用中效果明显由于现有的模型。本申请能自动为文本添加标点符号,补充必要的句子结构信息,会大大提高后续自然语言处理任务的效果。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的机器人控制方法。对于本申请装置实施例中未披露的细节,请参照本申请上述的机器人控制方法的实施例。
图4示意性示出了根据本申请的一个实施例的文本添加标点装置的框图。
参照图4所示,根据本申请的一个实施例的文本添加标点装置400,包括获取模块401、确定模块402和添加模块403。
在本申请的一些实施例中,基于前述方案,获取模块401配置为,配置为获取待添加文本,将待添加文本进行分词得到多个词语;获取多个词语中各个词语之间的关系,得到各个词语的依赖词及各个词语与其依赖词之间的关系;确定模块402配置为基于各个词语、各个词语的依赖词及各个词语与其依赖词之间的关系,确定各个词语的关系向量;添加模块403配置为获取多个词语的关系向量之间的关系,基于关系向量之间的关系,在多个词语之间添加标点。
在本申请的一些实施例中,基于前述方案,获取模块401配置为:将待添加文本按照文本顺序进行分词,得到第一分词结果;将待添加文本按照文本倒序进行分词,得到第二分词结果;获取第一分词结果与第二分词结果之间的差异,将差异对应的待添加文本从中间到两边进行分词,得到差异结果;将第一分词结果中与第二分词结果之间的差异替换为差异结果,将替换后的第一分词结果作为多个词语。
在本申请的一些实施例中,基于前述方案,获取模块401配置为:获取各个词语的词性和位置;根据各个词语的词性和位置,确定各个词语之间的关系。
在本申请的一些实施例中,基于前述方案,确定模块402配置为:获取基于各个词语得到的第一向量;获取基于各个词语的依赖词得到的第二向量;获取基于各个词语与其依赖词之间的关系得到的第三向量;将第一向量、第二向量和第三向量进行组合,得到各个词语对应的关系向量。
在本申请的一些实施例中,基于前述方案,确定模块402配置为:将各个词语进行编码得到第一序列;将各个词语的依赖词进行编码得到第二序列;将各个词语与其依赖词之间的关系进行编码得到第三序列;对第一序列、第二序列和第三序列进行截断或补零,将截断或补零后的第一序列映射为第一向量,将截断或补零后的第二序列映射为第二向量,将截断或补零后的第三序列映射为第三向量。
在本申请的一些实施例中,基于前述方案,添加模块403配置为:将多个词语的关系向量输入预训练的注意力模型,得到多个词语的关系向量之间的关系。
在本申请的一些实施例中,基于前述方案,添加模块403配置为:将标点添加在多个词语之间,以得到多种添加方式;通过双向LSTM层对关系向量之间的关系进行特征提取;基于特征,利用维特比算法计算各种添加方式的概率,基于多种方式中概率最大的添加方式在多个词语之间添加标点。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
下面参照图5来描述根据本申请的这种实施方式的电子设备50。图5显示的电子设备50仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图5所示,电子设备50以通用计算设备的形式表现。电子设备50的组件可以包括但不限于:上述至少一个处理单元51、上述至少一个存储单元52、连接不同系统组件(包括存储单元52和处理单元51)的总线53、显示单元54。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元51执行,使得所述处理单元51执行本说明书上述“实施例方法”部分中描述的根据本申请各种示例性实施方式的步骤。
存储单元52可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)521和/或高速缓存存储单元522,还可以进一步包括只读存储单元(ROM)523。
存储单元52还可以包括具有一组(至少一个)程序模块525的程序/实用工具524,这样的程序模块525包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线53可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备50也可以与一个或多个外部设备(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备50交互的设备通信,和/或与使得该电子设备50能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口55进行。并且,电子设备50还可以通过网络适配器56与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器56通过总线53与电子设备50的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备50使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
根据本申请一个实施例,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。
参考图6所示,描述了根据本申请的实施方式的用于实现上述方法的程序产品60,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个计算机可读存储介质的任意组合。计算机可读存储介质可以是可读信号介质或者可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性。计算机可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种文本添加标点的方法,其中,包括:
    获取待添加文本,将所述待添加文本进行分词得到多个词语;
    获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;
    基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;
    获取所述多个词语的关系向量之间的关系;
    基于所述关系向量之间的关系,在所述多个词语之间添加标点。
  2. 根据权利要求1所述的文本添加标点的方法,其中,所述将所述待添加文本进行分词得到多个词语,包括:
    将所述待添加文本按照文本顺序进行分词,得到第一分词结果;
    将所述待添加文本按照文本倒序进行分词,得到第二分词结果;
    获取所述第一分词结果与所述第二分词结果之间的差异,将所述差异对应的待添加文本从中间到两边进行分词,得到差异结果;
    将所述第一分词结果中与所述第二分词结果之间的差异替换为所述差异结果,将替换后的第一分词结果作为所述多个词语。
  3. 根据权利要求1所述的文本添加标点的方法,其中,所述获取所述多个词语之间的关系,包括:
    获取所述各个词语的词性和位置;
    根据所述各个词语的词性和位置,确定所述各个词语之间的关系。
  4. 根据权利要求1所述的文本添加标点的方法,其中,所述基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语对应的关系向量,包括:
    获取基于所述各个词语得到的第一向量;
    获取基于所述各个词语的依赖词得到的第二向量;
    获取基于所述各个词语与其依赖词之间的关系得到的第三向量;
    将所述第一向量、所述第二向量和所述第三向量进行组合,得到所述各个词语对应的关系向量。
  5. 根据权利要求4所述的文本添加标点的方法,其中,所述获取基于所述各个词语得到的第一向量;获取基于所述各个词语的依赖词得到的第二向量;获取基于所述各个词语与其依赖词之间的关系得到的第三向量,包括:
    将所述各个词语进行编码得到第一序列;
    将所述各个词语的依赖词进行编码得到第二序列;
    将所述各个词语与其依赖词之间的关系进行编码得到第三序列;
    对所述第一序列、所述第二序列和所述第三序列进行截断或补零,将截断或补零后的第一序列映射为所述第一向量,将截断或补零后的第二序列映射为所述第二向量,将截断或补零后的第三序列映射为所述第三向量。
  6. 根据权利要求1所述的文本添加标点的方法,其中,所述获取所述多个词语的关系向量之间的关系,包括:
    将所述所述多个词语的关系向量输入预训练的注意力模型,得到所述多个词语的关系向量之间的关系。
  7. 根据权利要求1所述的文本添加标点的方法,其中,所述基于所述关系向量之间的关系,在所述多个词语之间添加标点,包括:
    将标点添加在所述多个词语之间,以得到多种添加方式;
    通过双向LSTM层对所述关系向量之间的关系进行特征提取;
    基于所述特征,利用维特比算法计算各种添加方式的概率,基于所述多种方式中概率最大的添加方式在所述多个词语之间添加标点。
  8. 一种文本添加标点的装置,其中,包括:
    获取模块,配置为获取待添加文本,将所述待添加文本进行分词得到多个词语;获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;
    确定模块,配置为基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;
    添加模块,配置为获取所述多个词语的关系向量之间的关系,基于所述关系向量之间的关系,在所述多个词语之间添加标点。
  9. 一种电子装置,其中,所述电子装置包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时执行以下步骤:
    获取待添加文本,将所述待添加文本进行分词得到多个词语;
    获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;
    基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;
    获取所述多个词语的关系向量之间的关系;
    基于所述关系向量之间的关系,在所述多个词语之间添加标点。
  10. 根据权利要求9所述的电子装置,其中,所述将所述待添加文本进行分词得到多个词语,包括:
    将所述待添加文本按照文本顺序进行分词,得到第一分词结果;
    将所述待添加文本按照文本倒序进行分词,得到第二分词结果;
    获取所述第一分词结果与所述第二分词结果之间的差异,将所述差异对应的待添加文本从中间到两边进行分词,得到差异结果;
    将所述第一分词结果中与所述第二分词结果之间的差异替换为所述差异结果,将替换后的第一分词结果作为所述多个词语。
  11. 根据权利要求9所述的电子装置,其中,所述获取所述多个词语之间的关系,包括:
    获取所述各个词语的词性和位置;
    根据所述各个词语的词性和位置,确定所述各个词语之间的关系。
  12. 根据权利要求9所述的电子装置,其中,所述基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语对应的关系向量,包括:
    获取基于所述各个词语得到的第一向量;
    获取基于所述各个词语的依赖词得到的第二向量;
    获取基于所述各个词语与其依赖词之间的关系得到的第三向量;
    将所述第一向量、所述第二向量和所述第三向量进行组合,得到所述各个词语对应的关系向量。
  13. 根据权利要求12所述的电子装置,其中,所述获取基于所述各个词语得到的第一向量;获取基于所述各个词语的依赖词得到的第二向量;获取基于所述各个词语与其依赖词之间的关系得到的第三向量,包括:
    将所述各个词语进行编码得到第一序列;
    将所述各个词语的依赖词进行编码得到第二序列;
    将所述各个词语与其依赖词之间的关系进行编码得到第三序列;
    对所述第一序列、所述第二序列和所述第三序列进行截断或补零,将截断或补零后的第一序列映射为所述第一向量,将截断或补零后的第二序列映射为所述第二向量,将截断或补零后的第三序列映射为所述第三向量。
  14. 根据权利要求9所述的电子装置,其中,所述获取所述多个词语的关系向量之间的关系,包括:
    将所述所述多个词语的关系向量输入预训练的注意力模型,得到所述多个词语的关系向量之间的关系。
  15. 根据权利要求9所述的电子装置,其中,所述基于所述关系向量之间的关系,在所述多个词语之间添加标点,包括:
    将标点添加在所述多个词语之间,以得到多种添加方式;
    通过双向LSTM层对所述关系向量之间的关系进行特征提取;
    基于所述特征,利用维特比算法计算各种添加方式的概率,基于所述多种方式中概率最大的添加方式在所述多个词语之间添加标点。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个指令,所述至少一个指令被处理器执行时实现时执行以下步骤:
    获取待添加文本,将所述待添加文本进行分词得到多个词语;
    获取所述多个词语中各个词语之间的关系,得到各个词语的依赖词及所述各个词语与其依赖词之间的关系;
    基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语的关系向量;
    获取所述多个词语的关系向量之间的关系;
    基于所述关系向量之间的关系,在所述多个词语之间添加标点。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述待添加文本进行分词得到多个词语,包括:
    将所述待添加文本按照文本顺序进行分词,得到第一分词结果;
    将所述待添加文本按照文本倒序进行分词,得到第二分词结果;
    获取所述第一分词结果与所述第二分词结果之间的差异,将所述差异对应的待添加文本从中间到两边进行分词,得到差异结果;
    将所述第一分词结果中与所述第二分词结果之间的差异替换为所述差异结果,将替换后的第一分词结果作为所述多个词语。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述获取所述多个词语之间的关系,包括:
    获取所述各个词语的词性和位置;
    根据所述各个词语的词性和位置,确定所述各个词语之间的关系。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述基于所述各个词语、所述各个词语的依赖词及所述各个词语与其依赖词之间的关系,确定所述各个词语对应的关系向量,包括:
    获取基于所述各个词语得到的第一向量;
    获取基于所述各个词语的依赖词得到的第二向量;
    获取基于所述各个词语与其依赖词之间的关系得到的第三向量;
    将所述第一向量、所述第二向量和所述第三向量进行组合,得到所述各个词语对应的关系向量。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述获取基于所述各个词语得到的第一向量;获取基于所述各个词语的依赖词得到的第二向量;获取基于所述各个词语与其依赖词之间的关系得到的第三向量,包括:
    将所述各个词语进行编码得到第一序列;
    将所述各个词语的依赖词进行编码得到第二序列;
    将所述各个词语与其依赖词之间的关系进行编码得到第三序列;
    对所述第一序列、所述第二序列和所述第三序列进行截断或补零,将截断或补零后的第一序列映射为所述第一向量,将截断或补零后的第二序列映射为所述第二向量,将截断或补零后的第三序列映射为所述第三向量。
PCT/CN2021/084169 2020-11-25 2021-03-30 文本添加标点的方法、装置、介质及电子设备 WO2021213155A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011344671.1A CN112464642A (zh) 2020-11-25 2020-11-25 文本添加标点的方法、装置、介质及电子设备
CN202011344671.1 2020-11-25

Publications (1)

Publication Number Publication Date
WO2021213155A1 true WO2021213155A1 (zh) 2021-10-28

Family

ID=74807954

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084169 WO2021213155A1 (zh) 2020-11-25 2021-03-30 文本添加标点的方法、装置、介质及电子设备

Country Status (2)

Country Link
CN (1) CN112464642A (zh)
WO (1) WO2021213155A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629237A (zh) * 2023-07-25 2023-08-22 江西财经大学 基于逐步集成多层注意力的事件表示学习方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464642A (zh) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 文本添加标点的方法、装置、介质及电子设备
CN117113941B (zh) * 2023-10-23 2024-02-06 新声科技(深圳)有限公司 标点符号恢复方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153687A (zh) * 2017-04-18 2017-09-12 东北大学 一种社交网络文本数据的索引方法
JP2017167882A (ja) * 2016-03-17 2017-09-21 日本電気株式会社 文境界推定装置、方法およびプログラム
CN109062902A (zh) * 2018-08-17 2018-12-21 科大讯飞股份有限公司 一种文本语义表达方法及装置
CN109614627A (zh) * 2019-01-04 2019-04-12 平安科技(深圳)有限公司 一种文本标点预测方法、装置、计算机设备及存储介质
CN110032732A (zh) * 2019-03-12 2019-07-19 平安科技(深圳)有限公司 一种文本标点预测方法、装置、计算机设备及存储介质
CN111027291A (zh) * 2019-11-27 2020-04-17 达而观信息科技(上海)有限公司 文本中标点符号添加、模型训练方法、装置及电子设备
CN111414745A (zh) * 2020-04-03 2020-07-14 龙马智芯(珠海横琴)科技有限公司 文本标点确定方法与装置、存储介质、电子设备
CN112464642A (zh) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 文本添加标点的方法、装置、介质及电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017167882A (ja) * 2016-03-17 2017-09-21 日本電気株式会社 文境界推定装置、方法およびプログラム
CN107153687A (zh) * 2017-04-18 2017-09-12 东北大学 一种社交网络文本数据的索引方法
CN109062902A (zh) * 2018-08-17 2018-12-21 科大讯飞股份有限公司 一种文本语义表达方法及装置
CN109614627A (zh) * 2019-01-04 2019-04-12 平安科技(深圳)有限公司 一种文本标点预测方法、装置、计算机设备及存储介质
CN110032732A (zh) * 2019-03-12 2019-07-19 平安科技(深圳)有限公司 一种文本标点预测方法、装置、计算机设备及存储介质
CN111027291A (zh) * 2019-11-27 2020-04-17 达而观信息科技(上海)有限公司 文本中标点符号添加、模型训练方法、装置及电子设备
CN111414745A (zh) * 2020-04-03 2020-07-14 龙马智芯(珠海横琴)科技有限公司 文本标点确定方法与装置、存储介质、电子设备
CN112464642A (zh) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 文本添加标点的方法、装置、介质及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629237A (zh) * 2023-07-25 2023-08-22 江西财经大学 基于逐步集成多层注意力的事件表示学习方法及系统
CN116629237B (zh) * 2023-07-25 2023-10-10 江西财经大学 基于逐步集成多层注意力的事件表示学习方法及系统

Also Published As

Publication number Publication date
CN112464642A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
US11900056B2 (en) Stylistic text rewriting for a target author
US11928439B2 (en) Translation method, target information determining method, related apparatus, and storage medium
US20210406476A1 (en) Method, electronic device, and storage medium for extracting event from text
US10679148B2 (en) Implicit bridging of machine learning tasks
EP4060565A1 (en) Method and apparatus for acquiring pre-trained model
WO2021213155A1 (zh) 文本添加标点的方法、装置、介质及电子设备
WO2021179570A1 (zh) 序列标注方法、装置、计算机设备和存储介质
US20210312139A1 (en) Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN110276023B (zh) Poi变迁事件发现方法、装置、计算设备和介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN113361285B (zh) 自然语言处理模型的训练方法、自然语言处理方法及装置
CN111931477B (zh) 文本匹配方法、装置、电子设备以及存储介质
WO2022116445A1 (zh) 文本纠错模型建立方法、装置、介质及电子设备
EP4109324A2 (en) Method and apparatus for identifying noise samples, electronic device, and storage medium
JP7337979B2 (ja) モデル訓練方法および装置、テキスト予測方法および装置、電子デバイス、コンピュータ可読記憶媒体、およびコンピュータプログラム
CN113641830B (zh) 模型预训练方法、装置、电子设备和存储介质
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
JP7291181B2 (ja) 業界テキスト増分方法、関連装置、およびコンピュータプログラム製品
CN113743101A (zh) 文本纠错方法、装置、电子设备和计算机存储介质
CN112417860A (zh) 训练样本增强方法、系统、设备及存储介质
US20230139642A1 (en) Method and apparatus for extracting skill label
WO2023061441A1 (zh) 文本的量子线路确定方法、文本分类方法及相关装置
WO2023116572A1 (zh) 一种词句生成方法及相关设备
CN114841162B (zh) 文本处理方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21793148

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21793148

Country of ref document: EP

Kind code of ref document: A1