WO2022141855A1 - Text regularization method and apparatus, and electronic device and storage medium - Google Patents

Text regularization method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2022141855A1
WO2022141855A1 PCT/CN2021/083493 CN2021083493W WO2022141855A1 WO 2022141855 A1 WO2022141855 A1 WO 2022141855A1 CN 2021083493 W CN2021083493 W CN 2021083493W WO 2022141855 A1 WO2022141855 A1 WO 2022141855A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
text
characters
feature vector
regularized
Prior art date
Application number
PCT/CN2021/083493
Other languages
French (fr)
Chinese (zh)
Inventor
李俊杰
蒋伟伟
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022141855A1 publication Critical patent/WO2022141855A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a text regularization method, apparatus, electronic device and storage medium.
  • the embodiments of the present application provide a text regularization method, apparatus, electronic device and storage medium, which can perform text regularization according to the language type of the text to be regularized and the feature vector of each character, so as to improve the text regularization efficiency and reduce labor costs.
  • an embodiment of the present application provides a text regularization method, including: obtaining text to be regularized; characterizing the text to be regularized to obtain multiple characters; encoding, to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
  • an embodiment of the present application provides a text regularization device, including: an acquisition unit, configured to acquire the text to be regularized; a processing unit, configured to perform character segmentation on the text to be regularized to obtain a plurality of characters; encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the multi-character context information of each character in the characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized The regular text of the text.
  • an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device executes a text regularization method
  • the text regularization method includes: obtaining the text to be regularized; characterizing the text to be regularized to obtain multiple characters; Encoding is performed to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters ;
  • the text to be regularized is subjected to regular processing to obtain the regular text of the text to be regularized.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute a text regularization method, where the text regularization method includes: obtaining the text to be regularized ; characterize the text to be regularized to obtain a plurality of characters; encode each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the The first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the text to be regularized
  • the language type of the to-be-regularized text is processed to obtain the regular text of the to-be-regularized text.
  • FIG. 1 is a schematic flowchart of a text regularization provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of encoding and decoding processing of non-standard characters according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of encoding and decoding of non-standard characters by an encoder and a decoder according to an embodiment of the present application.
  • FIG. 4 is a block diagram of functional units of a text regularization apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a text regularization apparatus provided by an embodiment of the present application.
  • the technical solution of the present application relates to the field of artificial intelligence technology, so as to realize regular text and help to promote the construction of smart cities.
  • the data involved in this application such as to-be-regular text and/or regular text, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
  • FIG. 1 is a schematic flowchart of a text regularization method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.
  • the text regularization device obtains the text to be regularized.
  • the text to be regularized may be manually input by the user in the information input field of the text regularization device, or it may be automatically read by the text regularization device from a text library.
  • the text to be regularized may be some If there is a document to be regularized, the text regularization device can sequentially read the text to be regularized from the document. Therefore, this application does not limit the acquisition of regular text.
  • the text regularization device performs character segmentation on the regular text to obtain multiple characters.
  • the to-be-regular text may be character-segmented by a tokenizer to obtain multiple characters, for example, the character-segmented regular text may be segmented by a word2vec tokenizer.
  • the characters may be English words, Chinese words, French words or special symbols, such as "$", "/", and so on.
  • the text regularization device encodes each of the multiple characters to obtain a first feature vector of each of the multiple characters, wherein the first feature vector of each of the multiple characters is used to represent multiple Contextual information for each character in a character.
  • each character in the plurality of characters is encoded to obtain a character vector corresponding to each character in the plurality of characters.
  • each character is segmented to obtain the letter string of each character; each letter in the letter string of each character is encoded to obtain the letter vector corresponding to each letter; finally, each letter is The letter vector is encoded to get a character vector for each character.
  • the character “Achieve” is processed as the letter string of "A”, “c”, “h”, “i”, “e”, “v”, “e”, and the letter string is The letter vector for each letter in is modeled as the input to the encoder, resulting in a character vector for the character "Achieve".
  • a first text corresponding to the character A is constructed with the character A as the center, wherein the character A is any one of the multiple characters, and the first text includes X characters located before the character A in the regular text , the character A, and the Y characters located after the character A in the regular text, where X and Y are both integers greater than or equal to 1; Splicing (ie, horizontal splicing), the first feature vector corresponding to the character A is obtained, wherein the first feature vector corresponding to the character A is used to represent the contextual information of the character A in the first text.
  • Splicing ie, horizontal splicing
  • the character A is the first character or the last character in the to-be-regularized text
  • you can fill in the preset character for example, you can fill in the start character S
  • the mode constructs the first text for character A.
  • the text regularization device performs regularization processing on the regular text according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized to obtain regular text of the text to be regularized.
  • an attribute of each character in the plurality of characters is determined, wherein the attribute of each character includes a standard character or a non-standard character; then, the The standard character in the text to be regularized is used as the regular character of the marked character, that is, the standard character itself is used as the regular character of the standard character, and according to the language type of the text to be regularized and the first corresponding to the non-standard character in the text to be regularized feature vector, encode and decode the standard character to obtain the regular character of the non-standard character; finally, combine the regular character of the standard character and the regular character of the non-standard character in the regular text to obtain the regular character of the regular text to be regular text.
  • a standard character refers to a character with the same pronunciation and writing.
  • its pronunciation and writing are the same, that is, both are "year”, and the regular character of the standard character is itself.
  • the non-standard characters involved in this application include but are not limited to the following:
  • the second text includes the M characters located before the character B in the text to be regularized, the character B and the text to be regularized.
  • N characters after character B where character B is any non-standard character in the text to be regularized, M and N are integers greater than or equal to 1; then, through double-byte encoding (Byte-Pair Encoding , BPE) to encode each character in the second text to obtain the second feature vector of each character in the second text.
  • BPE double-byte encoding
  • each character in the second text is split into letter strings, and according to the frequency of occurrence of the letter strings of all characters in the second text, the letter strings of each character are combined to obtain a new letter string; then, the new letter string of each character is input into the encoder for encoding, and the second feature vector of each character in the second text is obtained.
  • the problem of unregistered words in the second text can be solved by double-byte encoding.
  • the second feature vector of each character in the second text is input into the Transformer-XL network for feature extraction to obtain a third feature vector corresponding to character B, where the third feature vector of character B is used to represent character B Context information in the second text; finally, according to the first feature vector of the character B, the third feature vector of the character B and the language type of the text to be regularized, the character B is encoded and decoded to obtain the corresponding character B regular characters.
  • the process of encoding and decoding the character B will be described in detail later, and will not be described too much here.
  • first character segmentation is performed on the regular text to be treated, and then each character is encoded to obtain the first feature vector of each character; finally, according to the first feature vector of each character and
  • the language type of the text to be regularized, and the regularization of the regular text is performed, that is, the regularization of the regular text can be completed without manually writing regular rules, the efficiency of text regularization is improved, and the labor cost is saved.
  • the language type of the text to be regularized will be combined, so that texts in various languages can be regularized, so that the text regularization method of the present application has more usage scenarios.
  • FIG. 2 is a schematic flowchart of an encoding and decoding method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.
  • performing word embedding processing on character B is actually performing mapping processing on character B to obtain the fourth feature vector of character B.
  • the ASCII code of character B can be used as the fourth feature vector of character B.
  • encoding the attributes of character B is to map the part of speech to which character B belongs to obtain the part of speech vector of character B. For example, if character B is "currency", the GB232 code of "currency” is used as the character B. The part-of-speech vector.
  • the process of classifying the attributes of the character B only classifies whether the character is a standard character or a non-standard character. Subdivision is performed on non-standard characters. Therefore, the first feature vector of each character can only be used to distinguish whether each character is a standard character or a non-standard character, and cannot be further distinguished on non-standard characters.
  • the more detailed classification of character B and the part-of-speech vector of each non-standard character is mapped to obtain a more detailed category of each non-standard character.
  • the language type of the regular text is mapped to obtain the language vector of the regular text.
  • the GB2312 code of the Chinese representation of the language type for example, the language type is "English”, “Chinese”, “French”, etc.
  • the language type is "English”, “Chinese”, “French”, etc.
  • the encoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of encoder.
  • the character B is encoded according to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B, and the encoding parameters of the encoder (that is, the language vector), and the fifth feature vector corresponding to the character B is obtained. , and the hidden layer vector corresponding to character B.
  • the hidden layer vector output by the encoder last encoding is a preset hidden layer vector, for example, a zero vector.
  • the hidden layer vector finally output by the encoder is the hidden layer vector generated by the encoding process of character B. If other non-standard characters need to be encoded, it will be combined with The hidden layer vector corresponding to character B is used as the hidden layer vector of the next non-standard character to be encoded.
  • the character B is one of multiple consecutive non-standard characters in the regular text with the same attributes (that is, the parts of speech are exactly the same, for example, they are all dates in non-standard words), in order to speed up the processing of these multiple characters.
  • the encoding efficiency and encoding precision of non-standard characters can be encoded together with these multiple non-standard characters instead of encoding a certain non-standard character separately.
  • the hidden layer vector output by the last encoding contains the contextual semantic information of these multiple non-standard characters.
  • the multiple non-standard characters [X 1 , X 2 ,..., X n ] are successively encoded successfully, and the fifth feature vector [Y 1 , Y 2 ,..., Y n ] corresponding to the multiple non-standard characters is output. .
  • the three non-standard characters can be encoded continuously , and output the fifth feature vector of the three non-standard characters and the hidden layer vector obtained by the encoder for the last encoding.
  • the hidden layer vector contains the full-text semantic information of these three non-standard characters.
  • the decoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of decoder.
  • an attention mechanism operation is performed on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B.
  • the attention mechanism can be a general attention mechanism operation, for example, the fifth feature vector corresponding to the character B can be used as a key-value pair, that is, a key-value vector-value vector (Key-value); then, the decoder
  • the hidden layer vector output from the last decoding is used as the query vector (query) to perform the attention mechanism operation to obtain the sixth feature vector corresponding to the character B.
  • the attention mechanism operation involved in the follow-up is similar and will not be described.
  • the hidden layer vector output by the decoder last decoding is the hidden layer vector output by the encoder last encoding; if character B is not the first character to be decoded, Then the hidden layer vector output by the decoder last decoding is the hidden layer vector generated when the decoder decodes the previous character. Since the hidden layer vector output by the decoder last decoding (for example, the hidden layer vector output by the encoder last encoding) contains the contextual semantic information of the character B, through the attention mechanism operation, the key information of this decoding can be retained. Next, improve the decoding accuracy.
  • splicing the part-of-speech vector of character B, the third feature vector of character B, the sixth feature vector of character B and the decoding result decoded by the decoder last time to obtain the target feature vector of character B; according to the decoding of the encoder
  • the parameter (language vector) and the target feature vector of character B, decode character B to obtain the regular character corresponding to character B That is, use the decoding parameters of the decoder to operate the target feature vector to obtain the probability of each character falling into the standard dictionary, and use the standard character corresponding to the maximum probability as the regular character of the character B.
  • the decoding result decoded by the decoder last time is the decoding result generated in the process of decoding the character last time by the decoder (that is, the feature vector of the regular character of the previous character).
  • the decoding result of the last decoding is the feature vector of the preset character.
  • the preset character is the start symbol S
  • the feature vector of the start symbol S is spliced. , to indicate the start of this decoding.
  • character B is one of multiple consecutive non-standard characters in the text to be regularized with the same attributes (that is, the parts of speech are exactly the same, for example, all are dates in non-standard words), in order to speed up the decoding of non-standard characters
  • the plurality of non-standard characters will be sequentially decoded according to the fifth feature vector of the plurality of non-standard characters, and a certain non-standard character will not be decoded in isolation.
  • the hidden layer vector e0 output by the last encoding of the encoder is used, and the fifth feature vector [Y 1 of the plurality of non-standard characters [X 1 , X 2 ,..., X n ] is , Y 2 ,..., Y n ] perform the attention mechanism operation to obtain a sixth feature vector.
  • the hidden layer vector output by the encoder for the last encoding will contain the full text semantic information of the multiple non-standard characters [X 1 , X 2 ,..., X n ]
  • the decoding attention will be placed on the attention mechanism operation. Above the first character that needs to be decoded, thereby improving the decoding accuracy.
  • the part-of-speech vector of the multiple non-standard characters can be any one of the multiple non-standard characters
  • the part-of-speech vector of the non-standard characters, and the third feature vector of the plurality of non-standard characters is the average value of the third feature vectors of each non-standard character in the plurality of non-standard characters.
  • the first non-standard character to be decoded is decoded to obtain the first decoded decoding result Z 1 (that is, the first decoded non-standard character Regular characters of non-standard characters), and the first decoded hidden layer vector d 1 ; then, use the first decoded hidden layer vector d 1 , the first decoded decoding result Z 1 , multiple non-standard characters
  • the part-of-speech vector L and the third feature vector H of , and the fifth feature vector [Y 1 , Y 2 ,..., Y n ] of multiple non-standard characters are decoded for the second time, and the decoding result of the second decoding is obtained ( That is, the regular character of the second non-standard character that needs to be decoded) Z 2 , and the hidden layer vector of the second decoding; repeat the above steps until the multiple non-standard characters [X 1 , X 2 ,..., Regular character [Z 1 , Z 2
  • the non-standard character "$1 billion” as an example to illustrate the decoding process.
  • the first decoding process use the hidden layer vector output by the encoder for the last encoding and the fifth feature vector of the above three non-standard characters to perform the attention mechanism operation to obtain a sixth feature vector (because the first time The regular character "1", the sixth feature vector focuses on the character "1"); then, the sixth feature vector, the part-of-speech vector (the part-of-speech vectors of the three non-standard characters are the same), the third feature vector (This third feature vector is obtained by averaging the third feature vector of each non-standard character) and the feature vector of the start symbol S are spliced to obtain a target feature vector; the first decoding is performed according to the target feature vector, Obtain the vector of the character "1" (after mapping this vector, the regular character of the character "1” can be obtained as "one") and the hidden layer vector corresponding to the character "1”; then, perform the second decoding, using the first
  • the sixth feature vector, the part of speech vector, the third t feature vector and the first feature vector The vector of the character "1" output by the decoding is spliced to obtain a target feature vector vector, and the target feature vector is input into the decoder for decoding, and the vector of the character "billion” is obtained (after mapping this vector, "billion” is obtained
  • the regular character is "billion" and the hidden layer vector of the second decoding; then, the third decoding is performed, and the attention mechanism operation is performed using the hidden layer vector of the second decoding and the fifth feature vector of the above three characters , get a sixth feature vector, splicing the sixth feature vector, part-of-speech vector, third feature vector vector, and the character "billion” (the character corresponding to the regular character of "billion") output from the second decoding to get a Target feature vector, input this target feature vector into the decoder for decoding, and get the vector of the character "$" (after mapping, the regular vector of "$" can be obtained as "
  • the three consecutive standard characters "$1 billion” can be regularized into one billion dollars at one time, and then the above text to be regularized can be regularized as "Achieve record net income of about one billion dollars during the year".
  • FIG. 4 is a block diagram of functional units of a text regularization device provided by an embodiment of the present application.
  • the text regularization device 400 includes: an obtaining unit 401 and a processing unit 402, wherein: the obtaining unit 401 is used to obtain the text to be regularized; the processing unit 402 is used to perform character segmentation on the text to be regularized to obtain a plurality of characters; Each character in the plurality of characters is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized Regular text for regular text.
  • the processing unit 402 is specifically configured to: encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the first text corresponding to the character A is constructed.
  • the text includes X characters located before the character A in the text to be regularized, the character A, and Y characters located after the character A in the text to be regularized, and the character A is the plurality of any one of the characters, where X and Y are both integers greater than or equal to 1; the character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, The first feature vector of the character A is used to represent the context information of the character A in the first text.
  • the text to be regularized is subjected to regularization processing to obtain the text to be regularized
  • the attributes of the characters include standard characters or non-standard characters; the standard characters in the text to be regularized are used as regular characters of the marked characters; according to the language type and the first feature corresponding to the non-standard characters in the text to be regularized vector, encoding and decoding the non-standard characters to obtain the regular characters of the non-standard characters; combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular characters to be regular The regular text of the text.
  • the non-standard characters are encoded and decoded to obtain the non-standard characters.
  • the processing unit 402 is specifically configured to: take the character B as the center, construct a second text corresponding to the character B, and the second text includes the M texts located before the character B in the text to be regularized character, the character B, and the N characters located after the character B in the text to be regularized, where the character B is any non-standard character in the text to be regularized, where M and N are both greater than or an integer equal to 1; encode each character in the second text through double-byte encoding to obtain a second feature vector of each character in the second text; encode each character in the second text
  • the second feature vector of the character B is input into the Transformer-XL network, and the third feature vector corresponding to the character B is obtained, and the third feature vector of the character B is used to represent the character B in the second text.
  • encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B.
  • the processing unit 402 is specifically configured to: perform word embedding processing on the character B to obtain the fourth feature vector of the character B; encode the attributes of the character B to obtain the corresponding character B Part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the encoding decoder, encode the character B, and obtain the fifth feature vector of the character B; input the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder, and analyze the character B is decoded to obtain the regular text corresponding to the character B.
  • the processing unit 402 in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, is specifically configured to: according to the The hidden layer vector of the last encoding output of the encoder, the fourth feature vector of the character B, and the encoding parameters of the encoder are used to encode the character B to obtain the fifth feature vector of the character B.
  • the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B
  • the processing unit 402 is specifically configured to: perform an attention mechanism operation on the hidden layer vector decoded and output by the decoder last time and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B.
  • the target feature vector of the character B is decoded according to the encoding parameters of the encoder and the target feature vector of the character B to obtain the regular character corresponding to the character B.
  • FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device includes a processor and memory.
  • the electronic device may further include a transceiver.
  • the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 .
  • the memory 503 is used to store computer programs and data, and can transmit the data stored in the memory 503 to the processor 502 .
  • the processor 502 is configured to read the computer program in the memory 503 and perform the following operations: control the transceiver 501 to obtain the text to be regularized; perform character segmentation on the text to be regularized to obtain multiple characters; Each character is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent each character in the plurality of characters the context information; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by regularizing the text to be regularized.
  • the processor 502 is specifically configured to perform the following operations: Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the The first text includes X characters before the character A in the text to be regularized, the character A, and Y characters after the character A in the text to be regularized, where the character A is the Any one of the multiple characters, where X and Y are both integers greater than or equal to 1; the character vector corresponding to each character in the first text is spliced to obtain the first feature of the character A vector, the first feature vector of the character A is used to represent the context information of the character A in the first text.
  • the text to be regularized is subjected to regularization processing to obtain the text to be regularized
  • the processor 502 is specifically configured to perform the following operations: determine the attribute of each character in the The attributes of each character include standard characters or non-standard characters; the standard characters in the text to be regularized are used as the regular characters of the marked character; according to the language type and the first corresponding to the non-standard characters in the text to be regularized a feature vector, which encodes and decodes the non-standard characters to obtain the regular characters of the non-standard characters; combines the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the The regular text to be regular text.
  • the non-standard characters are encoded and decoded to obtain the non-standard characters.
  • the processor 502 is specifically configured to perform the following operations: centering on character B, construct a second text corresponding to the character B, where the second text includes the text to be regularized before the character B.
  • Context information according to the attribute of the character B, the third feature vector of the character B and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
  • encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B.
  • the processor 502 is specifically configured to perform the following operations: perform word embedding processing on the character B to obtain a fourth feature vector of the character B; Corresponding part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the The encoder encodes the character B to obtain the fifth feature vector of the character B; the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the The character B is decoded to obtain the regular text corresponding to the character B.
  • the processor 502 in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, is specifically configured to perform the following operations: According to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B and the encoding parameters of the encoder, the character B is encoded to obtain the fifth feature vector of the character B .
  • the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B
  • the processor 502 is specifically configured to perform the following operations: perform an attention mechanism operation on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B, and obtain the corresponding value of the character B.
  • the sixth feature vector splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B, and the decoding result decoded by the decoder last time to obtain the The target feature vector of the character B; according to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
  • the transceiver 501 may be the acquisition unit 401 of the text regularization apparatus 400 of the embodiment shown in FIG. 4
  • the processor 502 may be the processing unit 402 of the text regularization apparatus 400 of the embodiment shown in FIG. 4 .
  • the text regularization device in this application may include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, PDAs, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc.
  • smart phones such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.
  • tablet computers PDAs
  • notebook computers mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc.
  • MID Mobile Internet Devices
  • wearable devices etc.
  • the above text regularization device is only an example, not exhaustive, including but not limited to the above text regularization device.
  • the above-mentioned text regularization apparatus may further include: an intelligent vehicle-mounted terminal, a computer device, and the like.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any one of the text regularization methods described in the foregoing method embodiments some or all of the steps.
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any text regularization method.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
  • the integrated unit if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and in particular to a text regularization method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring text to be regularized; performing character segmentation on the text to be regularized, so as to obtain a plurality of characters; encoding each of the plurality of characters, so as to obtain a first feature vector of each of the plurality of characters, wherein the first feature vector of each of the plurality of characters is used for representing context information of each of the plurality of characters; and performing, according to the first feature vector of each of the plurality of characters and a language type of the text to be regularized, regularization processing on the text to be regularized, so as to obtain regularized text of the text to be regularized. The present application is beneficial for improving the efficiency and accuracy of regularizing text.

Description

文本正则方法、装置、电子设备及存储介质Text regularization method, device, electronic device and storage medium
本申请要求于2020年12月31日提交中国专利局、申请号为202011644545.8,发明名称为“文本正则方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 31, 2020 with the application number 202011644545.8 and the invention title is "Text Regularization Method, Apparatus, Electronic Device and Storage Medium", the entire contents of which are incorporated by reference in this application.
技术领域technical field
本申请涉及人工智能技术领域,具体涉及一种文本正则方法、装置、电子设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a text regularization method, apparatus, electronic device and storage medium.
背景技术Background technique
传统的文本正则系统的建立需要较强的语言学背景,往往需要特定领域的专家针对语言学特点人工进行构造大量复杂,繁琐的文本正则规则。与此同时,不同语言之间的语言学知识差异明显,无法进行有效的迁移,如果对一种新的语言进行文本正则,则需要重新构建一套文本正则规则。The establishment of a traditional text regularization system requires a strong linguistic background, and often requires experts in specific fields to manually construct a large number of complex and tedious text regularization rules according to linguistic characteristics. At the same time, there are obvious differences in linguistic knowledge between different languages, which cannot be effectively transferred. If text regularization is performed on a new language, a set of text regularization rules needs to be rebuilt.
发明人发现,近年来,随着人工智能的快速发展,基于编码器和解码器模型的神经网络的文本正则系统开始出现在大众的视野中。但发明人意识到,由于单纯的编码器和解码器的模型的软分类特性,单纯编码器和解码器模型无法得到令人满意的文本正则准确率。因此,目前主流的文本正则系统依然需要人工构造出一套特定、复杂、繁琐的文本正则规则,并且对于不同的语言则需要构造不同给的文本正则规则,需要投入大量的人力和物理,并且各种文本规则之间可能会存在代码冗余。The inventors found that in recent years, with the rapid development of artificial intelligence, text regularization systems based on neural networks of encoder and decoder models have begun to appear in the public eye. However, the inventor realized that due to the soft classification characteristics of the pure encoder and decoder models, the pure encoder and decoder models cannot obtain satisfactory text regularization accuracy. Therefore, the current mainstream text regularization system still needs to manually construct a set of specific, complex and cumbersome text regularization rules, and for different languages, different text regularization rules need to be constructed, which requires a lot of manpower and physics. There may be code redundancy between these text rules.
因此,现有文本正则的过程中需要人工构造文本正则系统,人力成本比较高,文本正则效率较慢。Therefore, in the process of existing text regularization, it is necessary to construct a text regularization system manually, the labor cost is relatively high, and the text regularization efficiency is relatively slow.
技术问题technical problem
本申请实施例提供了一种文本正则方法、装置、电子设备及存储介质,通过待正则文本的语言类型以及每个字符的特征向量进行文本正则,提高文本正则效率,降低人工成本。The embodiments of the present application provide a text regularization method, apparatus, electronic device and storage medium, which can perform text regularization according to the language type of the text to be regularized and the feature vector of each character, so as to improve the text regularization efficiency and reduce labor costs.
技术解决方案technical solutions
第一方面,本申请实施例提供一种文本正则方法,包括:获取待正则文本;对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。In a first aspect, an embodiment of the present application provides a text regularization method, including: obtaining text to be regularized; characterizing the text to be regularized to obtain multiple characters; encoding, to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
第二方面,本申请实施例提供一种文本正则装置,包括:获取单元,用于获取待正则文本;处理单元,用于对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。In a second aspect, an embodiment of the present application provides a text regularization device, including: an acquisition unit, configured to acquire the text to be regularized; a processing unit, configured to perform character segmentation on the text to be regularized to obtain a plurality of characters; encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the multi-character context information of each character in the characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized The regular text of the text.
第三方面,本申请实施例提供一种电子设备,包括:处理器,所述处理器与存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述电子设备执行文本正则方法,该文本正则方法包括:获取待正则文本;对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device executes a text regularization method, the text regularization method includes: obtaining the text to be regularized; characterizing the text to be regularized to obtain multiple characters; Encoding is performed to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters ; According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the text to be regularized is subjected to regular processing to obtain the regular text of the text to be regularized.
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序使得计算机执行文本正则方法,该文本正则方法包括:获取待正则文本;对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute a text regularization method, where the text regularization method includes: obtaining the text to be regularized ; characterize the text to be regularized to obtain a plurality of characters; encode each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the The first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the text to be regularized The language type of the to-be-regularized text is processed to obtain the regular text of the to-be-regularized text.
有益效果beneficial effect
实施本申请实施例,实现了无需人工编写正则规则即可完成对正则文本的正则化,提高了文本正则的效率,节约了人力成本。另外,在进行文本正则的过程中,会结合待正则文本的语言类型,实现可以对各种语言的文本都可以进行正则处理,使本申请的文本正则方法具有较多的使用场景。Implementing the embodiments of the present application realizes that regularization of regular text can be completed without manually writing regular rules, which improves the efficiency of text regularization and saves labor costs. In addition, in the process of text regularization, the language type of the text to be regularized will be combined, so that texts in various languages can be regularized, so that the text regularization method of the present application has more usage scenarios.
附图说明Description of drawings
图1为本申请实施例提供的一种文本正则的流程示意图。FIG. 1 is a schematic flowchart of a text regularization provided by an embodiment of the present application.
图2为本申请实施例提供的一种非标准字符的编解码处理的流程示意图。FIG. 2 is a schematic flowchart of encoding and decoding processing of non-standard characters according to an embodiment of the present application.
图3为本申请实施例提供的一种编码器和解码器对非标准字符的编码和解码的示意图。FIG. 3 is a schematic diagram of encoding and decoding of non-standard characters by an encoder and a decoder according to an embodiment of the present application.
图4为本申请实施例提供的一种文本正则装置的功能单元组成框图。FIG. 4 is a block diagram of functional units of a text regularization apparatus provided by an embodiment of the present application.
图5为本申请实施例提供的一种文本正则装置的结构示意图。FIG. 5 is a schematic structural diagram of a text regularization apparatus provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
本申请的技术方案涉及人工智能技术领域,以实现文本正则,有助于推动智慧城市的建设。可选的,本申请涉及的数据如待正则文本和/或正则文本等可存储于数据库中,或者可存储于区块链中,本申请不做限定。The technical solution of the present application relates to the field of artificial intelligence technology, so as to realize regular text and help to promote the construction of smart cities. Optionally, the data involved in this application, such as to-be-regular text and/or regular text, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
参阅图1,图1为本申请实施例提供的一种文本正则方法的流程示意图。该方法应用于文本正则装置。该方法包括以下步骤。Referring to FIG. 1 , FIG. 1 is a schematic flowchart of a text regularization method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.
101:文本正则装置获取待正则文本。101: The text regularization device obtains the text to be regularized.
示例性的,该待正则文本可以是用户在该文本正则装置的信息输入域中手动输入的,也可以是该文本正则装置从文本库中自动读取的,比如,该待正则文本可以是某篇待正则的文献,则文本正则装置可以从该文献中依次读取待正则文本。因此,本申请不对待正则文本的获取进行限定。Exemplarily, the text to be regularized may be manually input by the user in the information input field of the text regularization device, or it may be automatically read by the text regularization device from a text library. For example, the text to be regularized may be some If there is a document to be regularized, the text regularization device can sequentially read the text to be regularized from the document. Therefore, this application does not limit the acquisition of regular text.
102:文本正则装置对待正则文本进行字符切分,得到多个字符。102: The text regularization device performs character segmentation on the regular text to obtain multiple characters.
示例性的,可以通过分词器对该待正则文本进行字符切分,得到多个字符,比如,可以通过word2vec分词器对待正则文本进行字符切分。其中,字符可以为英文单词、中文单词、法语单词或者特殊符号,比如,“$”、“/”,等等。Exemplarily, the to-be-regular text may be character-segmented by a tokenizer to obtain multiple characters, for example, the character-segmented regular text may be segmented by a word2vec tokenizer. The characters may be English words, Chinese words, French words or special symbols, such as "$", "/", and so on.
103:文本正则装置对多个字符中的每个字符进行编码,得到多个字符中每个字符的第一特征向量,其中,多个字符中每个字符的第一特征向量用于表示多个字符中每个字符的上下文信息。103: The text regularization device encodes each of the multiple characters to obtain a first feature vector of each of the multiple characters, wherein the first feature vector of each of the multiple characters is used to represent multiple Contextual information for each character in a character.
示例性的,对该多个字符中的每个字符进行编码,得到该多个字符中的每个字符对应的字符向量。具体的,将每个字符进行切分处理,得到每个字符的字母串;对每个字符的字母串中的每个字母进行编码,得到每个字母对应的字母向量;最后,将每个字母的字母向量进行进行编码,得到每个字符的字符向量。比如,字符“Achieve”,将字符“Achieve”处理为“A”、“c”、“h”、“i”、“e”、“v”、“e”的字母串,并对该字母串中的每个字母的字母向量作为编码器的输入进行建模,得到字符“Achieve”的字符向量。然后,以字符A为中心构建与字符A对应的第一文本,其中,字符A为该多个字符中的任意一个字符,该第一文本包括该正则文本中位于该字符A之前的X个字符、该字符A以及该正则文本中位于该字符A之后的Y个字符,其中,X和Y均为大于或等于1的整数;然后,对该第一文本中的每个字符对应的字符向量进行拼接(即横向拼接),得到该字符A对应的第一特征向量,其中,该字符A对应的第一特征向量用于表示字符A在该第一文本中的上下文本信息。Exemplarily, each character in the plurality of characters is encoded to obtain a character vector corresponding to each character in the plurality of characters. Specifically, each character is segmented to obtain the letter string of each character; each letter in the letter string of each character is encoded to obtain the letter vector corresponding to each letter; finally, each letter is The letter vector is encoded to get a character vector for each character. For example, for the character "Achieve", the character "Achieve" is processed as the letter string of "A", "c", "h", "i", "e", "v", "e", and the letter string is The letter vector for each letter in is modeled as the input to the encoder, resulting in a character vector for the character "Achieve". Then, a first text corresponding to the character A is constructed with the character A as the center, wherein the character A is any one of the multiple characters, and the first text includes X characters located before the character A in the regular text , the character A, and the Y characters located after the character A in the regular text, where X and Y are both integers greater than or equal to 1; Splicing (ie, horizontal splicing), the first feature vector corresponding to the character A is obtained, wherein the first feature vector corresponding to the character A is used to represent the contextual information of the character A in the first text.
应理解,若该字符A之前没有X个字符,比如,字符A为该待正则文本中的第一个字符或者最后一个字符,则可以通过填充预设字符(比如,可以填充开始符S)的方式为字符A构造第一文本。It should be understood that if there are no X characters before the character A, for example, the character A is the first character or the last character in the to-be-regularized text, you can fill in the preset character (for example, you can fill in the start character S) The mode constructs the first text for character A.
104:文本正则装置根据多个字符中每个字符的第一特征向量以及待正则文本的语言类型,对待正则文本进行正则处理,得到待正则文本的正则文本。104: The text regularization device performs regularization processing on the regular text according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized to obtain regular text of the text to be regularized.
示例性的,根据该多个字符中每个字符的第一特征向量,确定该多个字符中每个字符的属性,其中,每个字符的属性包括标准字符或非标准字符;然后,将该待正则文本中的标准字符作为该标注字符的正则字符,即将标准字符本身作为该标准字符的正则字符,并根据该待正则文本的语言类型以及该待正则文本中的非标准字符对应的第一特征向量,对该标准字符进行编解码处理,得到该非标准字符的正则字符;最后,将待正则文本中标准字符的正则字符与非标准字符的正则字符进行组合,得到该待正则文本的正则文本。Exemplarily, according to the first feature vector of each character in the plurality of characters, an attribute of each character in the plurality of characters is determined, wherein the attribute of each character includes a standard character or a non-standard character; then, the The standard character in the text to be regularized is used as the regular character of the marked character, that is, the standard character itself is used as the regular character of the standard character, and according to the language type of the text to be regularized and the first corresponding to the non-standard character in the text to be regularized feature vector, encode and decode the standard character to obtain the regular character of the non-standard character; finally, combine the regular character of the standard character and the regular character of the non-standard character in the regular text to obtain the regular character of the regular text to be regular text.
示例性的,标准字符是指读音和书写是一样的字符,比如,对于字符“year”,其读音和书写是一样的,即都是“year”,则标准字符的正则字符就是自己本身。示例性的,本申请涉及的非标准字符包括但不限于下以下几种:Exemplarily, a standard character refers to a character with the same pronunciation and writing. For example, for the character "year", its pronunciation and writing are the same, that is, both are "year", and the regular character of the standard character is itself. Exemplarily, the non-standard characters involved in this application include but are not limited to the following:
日期,货币,地址,字母,基数词,序数词,网址,计量单位,分数形式,小数形式,电话号码,时间,数位,标点以及外来词。Dates, currencies, addresses, letters, cardinal numbers, ordinal numbers, web addresses, units of measurement, fractional forms, decimal forms, phone numbers, time, digits, punctuation, and foreign words.
进一步的,以字符B为中心,构建该字符B对应的第二文本,该第二文本包括该待正则文本中位于该字符B之前的M个字符、该字符B以及该待正则文本中位于该字符B之后的N个字符,其中,该字符B为该待正则文本中的任意一个非标准字符,M和N均为大于或等于1的整数;然后,通过双字节编码(Byte-Pair Encoding,BPE)对该第二文本中每个字符进行编码,得到第二文本中每个字符的第二特征向量。Further, taking the character B as the center, construct the second text corresponding to the character B, and the second text includes the M characters located before the character B in the text to be regularized, the character B and the text to be regularized. N characters after character B, where character B is any non-standard character in the text to be regularized, M and N are integers greater than or equal to 1; then, through double-byte encoding (Byte-Pair Encoding , BPE) to encode each character in the second text to obtain the second feature vector of each character in the second text.
具体的,将第二文本中的每个字符拆分为字母串,根据第二文本中所有字符的字母串的出现频率,对该每个字符的字母串进行组合,得到每个字符的新的字母串;然后,将每个字符的新的字母串输入到编码器中进行编码,得到第二文本中每个字符的第二特征向量。通过双字节编码可以解决第二文本中存在未登录词的问题。然后,将该第二文本中每个字符的第二特征向量输入到Transformer-XL网络进行特征提取,得到字符B对应的第三特征向量,其中,字符B的第三特征向量用于表示字符B在第二文本中的上下文信息;最后,根据该字符B的第一特征向量、字符B的第三特征向量以及该待正则文本的语言类型,对该字符B进行编解码处理,得到字符B对应的正则字符。后面详细介绍对字符B进行编解码处理的过程,在此不做过多描述。Specifically, each character in the second text is split into letter strings, and according to the frequency of occurrence of the letter strings of all characters in the second text, the letter strings of each character are combined to obtain a new letter string; then, the new letter string of each character is input into the encoder for encoding, and the second feature vector of each character in the second text is obtained. The problem of unregistered words in the second text can be solved by double-byte encoding. Then, the second feature vector of each character in the second text is input into the Transformer-XL network for feature extraction to obtain a third feature vector corresponding to character B, where the third feature vector of character B is used to represent character B Context information in the second text; finally, according to the first feature vector of the character B, the third feature vector of the character B and the language type of the text to be regularized, the character B is encoded and decoded to obtain the corresponding character B regular characters. The process of encoding and decoding the character B will be described in detail later, and will not be described too much here.
可以看出,在本申请实施例中,首先对待正则文本进行字符切分,然后,对每个字符进行编码得到每个字符的第一特征向量;最后,根据每个字符的第一特征向量以及该待正则文本的语言类型,对待该正则文本进行正则处理,即实现了无需人工编写正则规则即可完成对正则文本的正则化,提高了文本正则的效率,节约了人力成本。另外,在进行文本正则的过程中,会结合待正则文本的语言类型,实现可以对各种语言的文本都可以进行正则处理,使本申请的文本正则方法具有较多的使用场景。It can be seen that, in the embodiment of the present application, first character segmentation is performed on the regular text to be treated, and then each character is encoded to obtain the first feature vector of each character; finally, according to the first feature vector of each character and The language type of the text to be regularized, and the regularization of the regular text is performed, that is, the regularization of the regular text can be completed without manually writing regular rules, the efficiency of text regularization is improved, and the labor cost is saved. In addition, in the process of text regularization, the language type of the text to be regularized will be combined, so that texts in various languages can be regularized, so that the text regularization method of the present application has more usage scenarios.
参阅图2,图2为本申请实施例提供的一种编解码方法的流程示意图。该方法应用于文本正则装置。该方法包括以下步骤。Referring to FIG. 2, FIG. 2 is a schematic flowchart of an encoding and decoding method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.
201:对字符B进行词嵌入处理,得到字符B的第四特征向量。201: Perform word embedding processing on character B to obtain a fourth feature vector of character B.
示例性的,对字符B进行词嵌入处理,其实上就是对字符B进行映射处理,得到字符B的第四特征向量,比如,可以将字符B的ASCII码作为字符B的第四特征向量。Exemplarily, performing word embedding processing on character B is actually performing mapping processing on character B to obtain the fourth feature vector of character B. For example, the ASCII code of character B can be used as the fourth feature vector of character B.
202:对字符B的属性进行编码,得到字符B的词类向量。202: Encode the attributes of the character B to obtain the part-of-speech vector of the character B.
示例性的,对字符B的属性进行编码,就是将字符B所属的词类进行映射,得到字符B的词类向量,比如,字符B为“货币”,则将“货币”的GB232码作为该字符B的词类向量。Exemplarily, encoding the attributes of character B is to map the part of speech to which character B belongs to obtain the part of speech vector of character B. For example, if character B is "currency", the GB232 code of "currency" is used as the character B. The part-of-speech vector.
应理解,虽然字符B的属性已经通过每个字符的第一特征向量进行分类得到了,但是,对字符B进行属性分类的过程中只是分类出字符是标准字符,还是非标准字符,并未在非标准字符上进行细分类,因此,每个字符的第一特征向量只能用来区分每个字符是标准字符,还是非标准字符,并不能在非标准字符上进行更进一步的区分。这里是将字符B进行更细的分类之后,映射出每个非标准字符的词类向量,以得到每个非标准字符更细致的类别。It should be understood that although the attributes of the character B have been classified by the first feature vector of each character, the process of classifying the attributes of the character B only classifies whether the character is a standard character or a non-standard character. Subdivision is performed on non-standard characters. Therefore, the first feature vector of each character can only be used to distinguish whether each character is a standard character or a non-standard character, and cannot be further distinguished on non-standard characters. Here is the more detailed classification of character B, and the part-of-speech vector of each non-standard character is mapped to obtain a more detailed category of each non-standard character.
203:对待正则文本的语言类型进行编码,得到待正则文本的语言向量,并将语言向量分别作为编码器的编码参数以及解码器的解码参数。203: Encode the language type of the regular text to obtain the language vector of the regular text, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively.
同样,对待正则文本的语言类型进行映射,得到待正则文本的语言向量。比如,可以用语言类型的中文表示(比如,语言类型分别为“英语”、“中文”、“法语”,等等)的GB2312码作为语言类型的语言向量。Similarly, the language type of the regular text is mapped to obtain the language vector of the regular text. For example, the GB2312 code of the Chinese representation of the language type (for example, the language type is "English", "Chinese", "French", etc.) can be used as the language vector of the language type.
204:将字符B的第四特征向量输入到编码器进行编码,对字符B进行编码,得到字符B的第五特征向量。204 : Input the fourth feature vector of the character B into the encoder for encoding, and encode the character B to obtain the fifth feature vector of the character B.
示例性的,编码器可以为基于长短期记忆网络、双向长短期记忆网络或者循环网络搭建的神经网络。本申请不对编码器的类型进行限定。Exemplarily, the encoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of encoder.
示例性的,根据编码器上次编码输出的隐层向量、字符B的第四特征向量以及编码器的编码参数(即语言向量),对字符B进行编码,得到字符B对应的第五特征向量,以及与字符B对应的隐层向量。Exemplarily, the character B is encoded according to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B, and the encoding parameters of the encoder (that is, the language vector), and the fifth feature vector corresponding to the character B is obtained. , and the hidden layer vector corresponding to character B.
应理解,在字符B为第一个需要编码的非标准字符的情况下,编码器上次编码输出的隐层向量为预设的隐层向量,比如,零向量。此外,若本次编码就只编码字符B,则编码器最后输出的隐层向量,即为对字符B进行编码过程生成的隐层向量,若还需要对其他非标准字符进行编码,则将与字符B对应的隐层向量作为下个需要编码的非标准字符的隐层向量。It should be understood that when the character B is the first non-standard character to be encoded, the hidden layer vector output by the encoder last encoding is a preset hidden layer vector, for example, a zero vector. In addition, if only character B is encoded in this encoding, the hidden layer vector finally output by the encoder is the hidden layer vector generated by the encoding process of character B. If other non-standard characters need to be encoded, it will be combined with The hidden layer vector corresponding to character B is used as the hidden layer vector of the next non-standard character to be encoded.
应理解的是,若字符B为待正则文本中多个连续且属性相同(即词类完全相同,比如,都为非标准词中的日期)的非标准字符中的一个,为了加快对这多个非标准字符的编码效率和编码精度,可以对这多个非标准字符一起编码,而不用单独对某个非标准字符进行编码。It should be understood that if the character B is one of multiple consecutive non-standard characters in the regular text with the same attributes (that is, the parts of speech are exactly the same, for example, they are all dates in non-standard words), in order to speed up the processing of these multiple characters. The encoding efficiency and encoding precision of non-standard characters can be encoded together with these multiple non-standard characters instead of encoding a certain non-standard character separately.
示例性的,如图3所示,待正则文本中存在多个连续且属性相同的非标准字符为[X 1,X 2,…,X n];对多个非标准字符[X 1,X 2,…,X n]中的每个非标准字符进行词嵌入,分别得到每个非标准字符的第四特征向量;然后,基于预设的隐层向量e 0以及编码器的编码参数,对该多个非标准字符中的第一个非标准字符X 1进行第一次编码,得到该第一个非标准字符X 1的第五特征向量Y1,以及与第一次编码对应的隐层向量e 1;进一步地,基于第一次编码输出的隐层向量e 1以及编码器的编码参数,对该多个非标准字符中的第二个非标准字符X 2进行第二次编码,得到第二个非标准字符X 2对应的第五特征向量,以及与第二次编码对应的隐层向量e 2;重复执行上述步骤,得到该多个非标准字符中最后一个非标准字符X n的第五特征向量,以及最后一次编码输出的隐层向量e n。其中,最后一次编码输出的隐层向量包含这多个非标准字符的上下文语义信息。这样就连续把这多个非标准字符[X 1,X 2,…,X n]编码成功,输出这多个非标准字符对应的第五特征向量[Y 1,Y 2,…,Y n]。 Exemplarily, as shown in Figure 3, there are multiple consecutive non-standard characters with the same attributes in the regular text [X 1 , X 2 ,..., X n ]; for multiple non-standard characters [X 1 , X 2 ,...,X n ] for each non-standard character in the word embedding, and obtain the fourth feature vector of each non-standard character respectively; then, based on the preset hidden layer vector e 0 and the encoding parameters of the encoder, for The first non-standard character X 1 in the plurality of non-standard characters is encoded for the first time to obtain the fifth feature vector Y1 of the first non-standard character X 1 and the hidden layer vector corresponding to the first encoding e 1 ; further, based on the hidden layer vector e 1 output by the first encoding and the encoding parameters of the encoder, perform the second encoding on the second non-standard character X 2 in the plurality of non-standard characters to obtain the first The fifth feature vector corresponding to the two non-standard characters X 2 , and the hidden layer vector e 2 corresponding to the second encoding; repeating the above steps to obtain the first non-standard character X n in the plurality of non-standard characters Five feature vectors, and the hidden layer vector en output from the last encoding. Among them, the hidden layer vector output by the last encoding contains the contextual semantic information of these multiple non-standard characters. In this way, the multiple non-standard characters [X 1 , X 2 ,..., X n ] are successively encoded successfully, and the fifth feature vector [Y 1 , Y 2 ,..., Y n ] corresponding to the multiple non-standard characters is output. .
举例来说,若待正则文本为“Achieve record net income of about $1 billion during the year”,则识别出非标准字符为“$”、“1”、“billion”,并且这三个非标准字符属性相同且连续。因此,可以连续对这三个非标准字符进行编码,一起输出这三个非标准字符的第五特征向量以及编码器最后一次编码得到的隐层向量。具体的,先对字符分别对字符“$”“1”“billion”进行词嵌入处理,得到每个非标准字符的第四特征向量;然后,将这三个字符的第四特征向量作为编码器的输入,编码器首先基于初始的隐层向量(即零向量)以及字符“$”的第四特征向量对字符“$”进行第一次编码,得到字符“$”的第五特征向量和第一次编码的隐层向量;然后,编码器基于第一次编码得到的隐层向量以及字符“1”的第四特征向量,对字符“1”进行第二次编码,得到字符“1”的第五特征向量,以及第二次编码的隐层向量;然后,编码器基于第二次编码的新的隐层向量以及字符“billion”的第四特征向量,对字符“billion”进行第三次编码,得到字符“billion”的第五特征向量,以及最后一次的隐层向量;最后一次的隐层向量包含有这三个非标准字符的全文语义信息。For example, if the regular text is "Achieve record net income of about $1 billion during the year", the non-standard characters are identified as "$", "1", "billion", and the attributes of these three non-standard characters are the same and continuous. Therefore, the three non-standard characters can be encoded continuously , and output the fifth feature vector of the three non-standard characters and the hidden layer vector obtained by the encoder for the last encoding. Specifically, first perform word embedding processing on the characters "$", "1", and "billion" respectively to obtain The fourth feature vector of each non-standard character; then, the fourth feature vector of these three characters is used as the input of the encoder, and the encoder is first based on the initial hidden layer vector (ie, the zero vector) and the first The four feature vectors encode the character "$" for the first time to obtain the fifth feature vector of the character "$" and the hidden layer vector of the first encoding; then, the encoder is based on the hidden layer vector and the character The fourth feature vector of "1", encode the character "1" for the second time to obtain the fifth feature vector of the character "1", and the hidden layer vector of the second encoding; then, the encoder is based on the second encoding. The new hidden layer vector and the fourth feature vector of the character "billion", encode the character "billion" for the third time to obtain the fifth feature vector of the character "billion", and the last hidden layer vector; the last time The hidden layer vector contains the full-text semantic information of these three non-standard characters.
205:将字符B的词类向量以及字符B的第五特征向量输入到解码器,对字符B进行解码,得到字符B的正则文本。205 : Input the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder, decode the character B, and obtain the regular text of the character B.
示例性的,解码器可以为基于长短期记忆网络、双向长短期记忆网络或者循环网络搭建的神经网络。本申请不对解码器的类型进行限定。Exemplarily, the decoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of decoder.
示例性的,将解码器上次解码输出的隐层向量与字符B对应的第五特征向量进行注意力机制运算,得到字符B对应的第六特征向量。其中,该注意力机制可以为通用的注意力机制运算,比如,可以将字符B对应的第五特征向量作为键值对,即关键值向量-价值向量(Key-value);然后,将解码器上次解码输出的隐层向量作为查询向量(query),以执行注意力机制运算,得到字符B对应的第六特征向量。后续涉及的注意力机制运算,与此类似,不再叙述。Exemplarily, an attention mechanism operation is performed on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B. Among them, the attention mechanism can be a general attention mechanism operation, for example, the fifth feature vector corresponding to the character B can be used as a key-value pair, that is, a key-value vector-value vector (Key-value); then, the decoder The hidden layer vector output from the last decoding is used as the query vector (query) to perform the attention mechanism operation to obtain the sixth feature vector corresponding to the character B. The attention mechanism operation involved in the follow-up is similar and will not be described.
应理解,若字符B为第一个需要解码的字符,则解码器上次解码输出的隐层向量为编码器最后一次编码输出的隐层向量;若字符B不是第一个需要解码的字符,则解码器上次解码输出的隐层向量即为解码器对上一个字符进行解码时所生成的隐层向量。由于解码器上次解码输出的隐层向量(比如,编码器最后一次编码输出的隐层向量),包含有字符B的上下文语义信息,通过注意力机制运算,可以将本次解码的重点信息保留下来,提高解码精度。It should be understood that if character B is the first character to be decoded, the hidden layer vector output by the decoder last decoding is the hidden layer vector output by the encoder last encoding; if character B is not the first character to be decoded, Then the hidden layer vector output by the decoder last decoding is the hidden layer vector generated when the decoder decodes the previous character. Since the hidden layer vector output by the decoder last decoding (for example, the hidden layer vector output by the encoder last encoding) contains the contextual semantic information of the character B, through the attention mechanism operation, the key information of this decoding can be retained. Next, improve the decoding accuracy.
进一步的,将字符B的词类向量、字符B的第三特征向量、字符B的第六特征向量以及解码器上次解码的解码结果进行拼接,得到字符B的目标特征向量;根据编码器的解码参数(语言向量)以及字符B的目标特征向量,对字符B进行解码,得到字符B对应的正则字符。即使用解码器的解码参数对目标特征向量进行运算,得到落入标准字典中各个字符的概率,将概率最大所对应的标准字符作为该字符B的正则字符。Further, splicing the part-of-speech vector of character B, the third feature vector of character B, the sixth feature vector of character B and the decoding result decoded by the decoder last time to obtain the target feature vector of character B; according to the decoding of the encoder The parameter (language vector) and the target feature vector of character B, decode character B to obtain the regular character corresponding to character B. That is, use the decoding parameters of the decoder to operate the target feature vector to obtain the probability of each character falling into the standard dictionary, and use the standard character corresponding to the maximum probability as the regular character of the character B.
其中,解码器上次解码的解码结果即解码器上次对字符进行解码过程中生成的解码结果(即上个字符的正则字符的特征向量)。应理解,若字符B为第一个需要解码的字符,则上次解码的解码结果即是预设字符的特征向量,比如,预设字符为开始符S,将开始符S的特征向量进行拼接,以指示本次解码的开始。Among them, the decoding result decoded by the decoder last time is the decoding result generated in the process of decoding the character last time by the decoder (that is, the feature vector of the regular character of the previous character). It should be understood that if the character B is the first character that needs to be decoded, the decoding result of the last decoding is the feature vector of the preset character. For example, the preset character is the start symbol S, and the feature vector of the start symbol S is spliced. , to indicate the start of this decoding.
同样,若字符B为该待正则文本中多个连续且属性相同(即词类完全相同,比如,都为非标准词中的日期)的非标准字符中的一个,为了加快对非标准字符的解码效率和解码精度,则会对根据这多个非标准字符的第五特征向量,对这个多个非标准字符依次进行解码,而不会孤立的对某个非标准字符进行解码。Similarly, if character B is one of multiple consecutive non-standard characters in the text to be regularized with the same attributes (that is, the parts of speech are exactly the same, for example, all are dates in non-standard words), in order to speed up the decoding of non-standard characters In terms of efficiency and decoding accuracy, the plurality of non-standard characters will be sequentially decoded according to the fifth feature vector of the plurality of non-standard characters, and a certain non-standard character will not be decoded in isolation.
示例性的,如图3所示,使用编码器最后一次编码输出的隐层向量e0,对该多个非标准字符[X 1,X 2,…,X n]的第五特征向量[Y 1,Y 2,…,Y n]进行注意力机制运算,得到一个第六特征向量。应理解,由于编码器最后一次编码输出的隐层向量会包含该多个非标准字符[X 1,X 2,…,X n]的全文语义信息,通过注意机制运算会将解码注意力放到第一个需要解码的字符上面,从而提高解码精度。然后,将该第六特征向量、该多个非标准字符的词类向量L、该多个非标准字符的第三特征向量H,以及预设符号的特征向量(图3中未示出)进行拼接,得到第一个需要解码的非标准字符的目标特征向量,其中,由于该多个非标准字符的属性相同,则该多个非标准字符的词类向量可以为该多个非标准字符中任意一个非标准字符的词类向量,该多个非标准字符的第三特征向量为该多个非标准字符中每个非标准字符的第三特征向量的平均值。最后,基于该第一个需要解码的非标准字符的目标特征向量,对该第一个需要解码的非标准字符进行解码,得到第一次解码的解码结果Z 1(即第一个需要解码的非标准字符的正则字符),以及第一次解码的隐层向量d 1;然后,使用该第一次解码的隐层向量d 1、第一次解码的解码结果Z 1、多个非标准字符的词类向量L以及第三特征向量H、以及多个非标准字符的第五特征向量[Y 1,Y 2,…,Y n],进行第二次解码,得到第二次解码的解码结果(即第二个需要解码的非标准字符的正则字符)Z 2,以及第二次解码的隐层向量;重复执行上述步骤,直至解码出这多个非标准字符[X 1,X 2,…,X n]中每个非标准字符的正则字符[Z 1,Z 2,…,Z n],停止解码。 Exemplarily, as shown in Fig. 3, the hidden layer vector e0 output by the last encoding of the encoder is used, and the fifth feature vector [Y 1 of the plurality of non-standard characters [X 1 , X 2 ,..., X n ] is , Y 2 ,…, Y n ] perform the attention mechanism operation to obtain a sixth feature vector. It should be understood that since the hidden layer vector output by the encoder for the last encoding will contain the full text semantic information of the multiple non-standard characters [X 1 , X 2 ,..., X n ], the decoding attention will be placed on the attention mechanism operation. Above the first character that needs to be decoded, thereby improving the decoding accuracy. Then, splicing the sixth feature vector, the part-of-speech vector L of the multiple non-standard characters, the third feature vector H of the multiple non-standard characters, and the feature vector of the preset symbol (not shown in FIG. 3 ) , to obtain the target feature vector of the first non-standard character to be decoded, wherein, since the attributes of the multiple non-standard characters are the same, the part-of-speech vector of the multiple non-standard characters can be any one of the multiple non-standard characters The part-of-speech vector of the non-standard characters, and the third feature vector of the plurality of non-standard characters is the average value of the third feature vectors of each non-standard character in the plurality of non-standard characters. Finally, based on the target feature vector of the first non-standard character to be decoded, the first non-standard character to be decoded is decoded to obtain the first decoded decoding result Z 1 (that is, the first decoded non-standard character Regular characters of non-standard characters), and the first decoded hidden layer vector d 1 ; then, use the first decoded hidden layer vector d 1 , the first decoded decoding result Z 1 , multiple non-standard characters The part-of-speech vector L and the third feature vector H of , and the fifth feature vector [Y 1 , Y 2 ,..., Y n ] of multiple non-standard characters are decoded for the second time, and the decoding result of the second decoding is obtained ( That is, the regular character of the second non-standard character that needs to be decoded) Z 2 , and the hidden layer vector of the second decoding; repeat the above steps until the multiple non-standard characters [X 1 , X 2 ,..., Regular character [Z 1 , Z 2 ,…, Z n ] for each non-standard character in X n ], stop decoding.
举例来说,以非标准字符为“$1 billion”为例说明解码的过程。第一次解码的过程中,使用编码器最后一次编码输出的隐层向量和上述三个非标准字符的第五特征向量进行注意力机制运算,得到一个第六特征向量(因为,第一次要正则字符“1”,则这个第六特征向量重点关注在字符“1”);然后,将该第六特征向量、词类向量(三个非标准字符的词类向量是相同的)、第三特征向量(这个第三特征向量是对每个非标准字符的第三特征向量求平均得到的)以及开始符号S的特征向量进行拼接,得到一个目标特征向量;根据该目标特征向量进行第一次解码,得到字符“1”的向量(将这个向量映射后,可得到字符“1”的正则字符为“one”)以及与字符“1”对应的隐层向量;然后,进行第二次解码,使用第一次解码输出的隐层向量与上述三个字符的第五特征向量进行注意力机制运算,得到一个第六特征向量,将这个第六特征向量、词类向量、第三t特征向量以及第一次解码输出的字符“1”的向量进行拼接,得到一个目标特征向量向量,将这个目标特征向量输入到解码器中进行解码,得到字符“billion”的向量(将这个向量映射后,得到“billion”的正则字符为“billion”)以及第二次解码的隐层向量;然后,进行第三次解码,使用第二次解码的隐层向量与上述三个字符的第五特征向量进行注意力机制运算,得到一个第六特征向量,将这个第六特征向量、词类向量、第三特征向量向量以及第二次解码输出的字符“billion”(“billion”的正则字符对应的字符)进行拼接,得到一个目标特征向量,将这个目标特征向量输入到解码器中进行解码,得到字符“$”的向量(映射后,可得到“$”的正则向量为“dollars”)以及一个解码器的隐层向量;最后,再使用第三次解码输出的隐层向量与上述三个字符的第五特征向量进行注意力机制运算,得到一个第六特征向,将这个第六特征向量、词类向量、第三特征向量向量以及第二次解码输出的字符“$”的向量进行拼接,得到一个目标特征向量,将这个目标特征向量输入到解码器中进行解码,解码出结束符号“end”,用于指示解码停止。For example, take the non-standard character "$1 billion" as an example to illustrate the decoding process. During the first decoding process, use the hidden layer vector output by the encoder for the last encoding and the fifth feature vector of the above three non-standard characters to perform the attention mechanism operation to obtain a sixth feature vector (because the first time The regular character "1", the sixth feature vector focuses on the character "1"); then, the sixth feature vector, the part-of-speech vector (the part-of-speech vectors of the three non-standard characters are the same), the third feature vector (This third feature vector is obtained by averaging the third feature vector of each non-standard character) and the feature vector of the start symbol S are spliced to obtain a target feature vector; the first decoding is performed according to the target feature vector, Obtain the vector of the character "1" (after mapping this vector, the regular character of the character "1" can be obtained as "one") and the hidden layer vector corresponding to the character "1"; then, perform the second decoding, using the first The hidden layer vector output by one decoding and the fifth feature vector of the above three characters are subjected to the attention mechanism operation to obtain a sixth feature vector. The sixth feature vector, the part of speech vector, the third t feature vector and the first feature vector The vector of the character "1" output by the decoding is spliced to obtain a target feature vector vector, and the target feature vector is input into the decoder for decoding, and the vector of the character "billion" is obtained (after mapping this vector, "billion" is obtained The regular character is "billion") and the hidden layer vector of the second decoding; then, the third decoding is performed, and the attention mechanism operation is performed using the hidden layer vector of the second decoding and the fifth feature vector of the above three characters , get a sixth feature vector, splicing the sixth feature vector, part-of-speech vector, third feature vector vector, and the character "billion" (the character corresponding to the regular character of "billion") output from the second decoding to get a Target feature vector, input this target feature vector into the decoder for decoding, and get the vector of the character "$" (after mapping, the regular vector of "$" can be obtained as "dollars") and a hidden layer vector of the decoder; Finally, use the hidden layer vector output by the third decoding and the fifth feature vector of the above three characters to perform the attention mechanism operation to obtain a sixth feature vector, the sixth feature vector, the part of speech vector, the third feature vector The vector and the vector of the character "$" output by the second decoding are spliced to obtain a target feature vector, which is input into the decoder for decoding, and the end symbol "end" is decoded to indicate the stop of decoding.
因此,通过上述的编码和解码的过程,可以一次性的将这三个连续的标准字符“$1 billion”正则化为one billion dollars,进而将上述待正则文本正则为“Achieve record net income of about one billion dollars during the year”。Therefore, through the above encoding and decoding process, the three consecutive standard characters "$1 billion" can be regularized into one billion dollars at one time, and then the above text to be regularized can be regularized as "Achieve record net income of about one billion dollars during the year".
可以看出,在申请实施例中,在对非标准字符进行编解码的过程中,采用了注意力机制,提高每次编解码的精度。另外,对于连续且属性相同的多个非标准字符可以同步进行编解码,并且在编解码过程中信息相互借鉴,提高了编解码的效率和精度。It can be seen that, in the application embodiment, in the process of encoding and decoding non-standard characters, an attention mechanism is adopted to improve the accuracy of each encoding and decoding. In addition, multiple consecutive non-standard characters with the same attributes can be encoded and decoded synchronously, and information can be learned from each other during the encoding and decoding process, which improves the efficiency and accuracy of encoding and decoding.
参阅图4,图4本申请实施例提供的一种文本正则装置的功能单元组成框图。文本正则装置400包括:获取单元401和处理单元402,其中:获取单元401,用于获取待正则文本;处理单元402,用于对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。Referring to FIG. 4 , FIG. 4 is a block diagram of functional units of a text regularization device provided by an embodiment of the present application. The text regularization device 400 includes: an obtaining unit 401 and a processing unit 402, wherein: the obtaining unit 401 is used to obtain the text to be regularized; the processing unit 402 is used to perform character segmentation on the text to be regularized to obtain a plurality of characters; Each character in the plurality of characters is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized Regular text for regular text.
在一些可能的实施方式中,在对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量方面,处理单元402,具体用于:对所述多个字符中的每个字符进行编码,得到所述多个字符中的每个字符对应的字符向量;以字符A为中心,构建与所述字符A对应的第一文本,所述第一文本包括所述待正则文本中位于所述字符A之前的X个字符、所述字符A以及所述待正则文本中位于所述字符A之后的Y个字符,所述字符A为所述多个字符中的任意一个字符,其中,X和Y均为大于或等于1的整数;将所述第一文本中的每个字符对应的字符向量进行拼接,得到所述字符A的第一特征向量,所述字符A的第一特征向量用于表示所述字符A在所述第一文本中的上下文信息。In some possible implementations, in terms of encoding each of the plurality of characters to obtain a first feature vector of each of the plurality of characters, the processing unit 402 is specifically configured to: encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the first text corresponding to the character A is constructed. The text includes X characters located before the character A in the text to be regularized, the character A, and Y characters located after the character A in the text to be regularized, and the character A is the plurality of any one of the characters, where X and Y are both integers greater than or equal to 1; the character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, The first feature vector of the character A is used to represent the context information of the character A in the first text.
在一些可能的实施方式中,在根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本方面,处理单元402,具体用于:根据所述多个字符中每个字符的第一特征向量,确定所述多个字符中每个字符的属性,所述多个字符中每个字符的属性包括标准字符或非标准字符;将所述待正则文本中的标准字符作为该标注字符的正则字符;根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符;将所述待正则文本中标准字符的正则字符与非标准字符的正则字符进行组合,得到所述待正则文本的正则文本。In some possible implementations, according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the text to be regularized is subjected to regularization processing to obtain the text to be regularized In terms of the regular text of the The attributes of the characters include standard characters or non-standard characters; the standard characters in the text to be regularized are used as regular characters of the marked characters; according to the language type and the first feature corresponding to the non-standard characters in the text to be regularized vector, encoding and decoding the non-standard characters to obtain the regular characters of the non-standard characters; combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular characters to be regular The regular text of the text.
在一些可能的实施方式中,在根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符方面,处理单元402,具体用于:以字符B为中心,构建所述字符B对应的第二文本,所述第二文本包括所述待正则文本中位于所述字符B之前的M个字符、所述字符B以及所述待正则文本中位于所述字符B之后的N个字符,所述字符B为所述待正则文本中的任意一个非标准字符,其中,M和N均为大于或等于1的整数;通过双字节编码对所述第二文本中每个字符进行编码,得到所述第二文本中每个字符的第二特征向量;将所述第二文本中每个字符的第二特征向量输入到Transformer-XL网络,得到所述字符B对应的第三特征向量,所述字符B的第三特征向量用于表示所述字符B在所述第二文本中的上下文信息;根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符。In some possible implementations, according to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the non-standard characters. In terms of regular characters, the processing unit 402 is specifically configured to: take the character B as the center, construct a second text corresponding to the character B, and the second text includes the M texts located before the character B in the text to be regularized character, the character B, and the N characters located after the character B in the text to be regularized, where the character B is any non-standard character in the text to be regularized, where M and N are both greater than or an integer equal to 1; encode each character in the second text through double-byte encoding to obtain a second feature vector of each character in the second text; encode each character in the second text The second feature vector of the character B is input into the Transformer-XL network, and the third feature vector corresponding to the character B is obtained, and the third feature vector of the character B is used to represent the character B in the second text. The context information of the text ; According to the attribute of the character B, the third feature vector of the character B and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
在一些可能的实施方式中,在根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符方面,处理单元402,具体用于:对所述字符B进行词嵌入处理,得到所述字符B的第四特征向量;对所述字符B的属性进行编码,得到与所述字符B对应的词类向量;对所述语言类型进行编码,得到语言向量,并将所述语言向量分别作为编码器的编码参数以及解码器的解码参数;将所述字符B的第四特征向量输入到所述编码器,对所述字符B进行编码,得到所述字符B的第五特征向量;将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本。In some possible implementations, encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B. In terms of characters, the processing unit 402 is specifically configured to: perform word embedding processing on the character B to obtain the fourth feature vector of the character B; encode the attributes of the character B to obtain the corresponding character B Part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the encoding decoder, encode the character B, and obtain the fifth feature vector of the character B; input the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder, and analyze the character B is decoded to obtain the regular text corresponding to the character B.
在一些可能的实施方式中,在将所述字符B的第四特征向量输入到所述编码器进行编码,得到所述字符B的第五特征向量方面,处理单元402,具体用于:根据所述编码器上次编码输出的隐层向量、所述字符B的第四特征向量以及所述编码器的编码参数,对所述字符B进行编码,得到所述字符B的第五特征向量。In some possible implementations, in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, the processing unit 402 is specifically configured to: according to the The hidden layer vector of the last encoding output of the encoder, the fourth feature vector of the character B, and the encoding parameters of the encoder are used to encode the character B to obtain the fifth feature vector of the character B.
在一些可能的实施方式中,在将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本方面,处理单元402,具体用于:将所述解码器上次解码输出的隐层向量与所述字符B对应的第五特征向量进行注意力机制运算,得到所述字符B对应的第六特征向量;将所述字符B的词类向量、所述字符B的第三特征向量、所述字符B的第六特征向量以及所述解码器上次解码的解码结果进行拼接,得到所述字符B的目标特征向量;根据所述编码器的编码参数以及所述字符B的目标特征向量,对所述字符B进行解码,得到所述字符B对应的正则字符。In some possible implementations, the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B In terms of text, the processing unit 402 is specifically configured to: perform an attention mechanism operation on the hidden layer vector decoded and output by the decoder last time and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B. feature vector; splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the character B The target feature vector of the character B is decoded according to the encoding parameters of the encoder and the target feature vector of the character B to obtain the regular character corresponding to the character B.
参阅图5,图5为本申请实施例提供的一种电子设备的结构示意图。该电子设备包括处理器和存储器。可选的,该电子设备还可包括收发器。例如,如图5所示,电子设备500包括收发器501、处理器502和存储器503。它们之间通过总线504连接。存储器503用于存储计算机程序和数据,并可以将存储器503存储的数据传输给处理器502。Referring to FIG. 5 , FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device includes a processor and memory. Optionally, the electronic device may further include a transceiver. For example, as shown in FIG. 5 , the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 . The memory 503 is used to store computer programs and data, and can transmit the data stored in the memory 503 to the processor 502 .
处理器502用于读取存储器503中的计算机程序执行以下操作:控制收发器501获取待正则文本;对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。The processor 502 is configured to read the computer program in the memory 503 and perform the following operations: control the transceiver 501 to obtain the text to be regularized; perform character segmentation on the text to be regularized to obtain multiple characters; Each character is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent each character in the plurality of characters the context information; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by regularizing the text to be regularized.
在一些可能的实施方式中,在对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量方面,处理器502具体用于执行以下操作:对所述多个字符中的每个字符进行编码,得到所述多个字符中的每个字符对应的字符向量;以字符A为中心,构建与所述字符A对应的第一文本,所述第一文本包括所述待正则文本中位于所述字符A之前的X个字符、所述字符A以及所述待正则文本中位于所述字符A之后的Y个字符,所述字符A为所述多个字符中的任意一个字符,其中,X和Y均为大于或等于1的整数;将所述第一文本中的每个字符对应的字符向量进行拼接,得到所述字符A的第一特征向量,所述字符A的第一特征向量用于表示所述字符A在所述第一文本中的上下文信息。In some possible implementation manners, in terms of encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, the processor 502 is specifically configured to perform the following operations: Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the The first text includes X characters before the character A in the text to be regularized, the character A, and Y characters after the character A in the text to be regularized, where the character A is the Any one of the multiple characters, where X and Y are both integers greater than or equal to 1; the character vector corresponding to each character in the first text is spliced to obtain the first feature of the character A vector, the first feature vector of the character A is used to represent the context information of the character A in the first text.
在一些可能的实施方式中,在根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本方面,处理器502具体用于执行以下操作:根据所述多个字符中每个字符的第一特征向量,确定所述多个字符中每个字符的属性,所述多个字符中每个字符的属性包括标准字符或非标准字符;将所述待正则文本中的标准字符作为该标注字符的正则字符;根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符;将所述待正则文本中标准字符的正则字符与非标准字符的正则字符进行组合,得到所述待正则文本的正则文本。In some possible implementations, according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the text to be regularized is subjected to regularization processing to obtain the text to be regularized In terms of the regular text, the processor 502 is specifically configured to perform the following operations: determine the attribute of each character in the The attributes of each character include standard characters or non-standard characters; the standard characters in the text to be regularized are used as the regular characters of the marked character; according to the language type and the first corresponding to the non-standard characters in the text to be regularized a feature vector, which encodes and decodes the non-standard characters to obtain the regular characters of the non-standard characters; combines the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the The regular text to be regular text.
在一些可能的实施方式中,在根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符方面,处理器502具体用于执行以下操作:以字符B为中心,构建所述字符B对应的第二文本,所述第二文本包括所述待正则文本中位于所述字符B之前的M个字符、所述字符B以及所述待正则文本中位于所述字符B之后的N个字符,所述字符B为所述待正则文本中的任意一个非标准字符,其中,M和N均为大于或等于1的整数;通过双字节编码对所述第二文本中每个字符进行编码,得到所述第二文本中每个字符的第二特征向量;将所述第二文本中每个字符的第二特征向量输入到Transformer-XL网络,得到所述字符B对应的第三特征向量,所述字符B的第三特征向量用于表示所述字符B在所述第二文本中的上下文信息;根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符。In some possible implementations, according to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the non-standard characters. In terms of regular characters, the processor 502 is specifically configured to perform the following operations: centering on character B, construct a second text corresponding to the character B, where the second text includes the text to be regularized before the character B. M characters, the character B, and the N characters located after the character B in the text to be regularized, where the character B is any non-standard character in the text to be regularized, wherein M and N are both is an integer greater than or equal to 1; encode each character in the second text through double-byte encoding to obtain a second feature vector of each character in the second text; encode each character in the second text The second feature vector of each character is input into the Transformer-XL network, and the third feature vector corresponding to the character B is obtained, and the third feature vector of the character B is used to represent the character B in the second text. Context information; according to the attribute of the character B, the third feature vector of the character B and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
在一些可能的实施方式中,在根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符方面,处理器502具体用于执行以下操作:对所述字符B进行词嵌入处理,得到所述字符B的第四特征向量;对所述字符B的属性进行编码,得到与所述字符B对应的词类向量;对所述语言类型进行编码,得到语言向量,并将所述语言向量分别作为编码器的编码参数以及解码器的解码参数;将所述字符B的第四特征向量输入到所述编码器,对所述字符B进行编码,得到所述字符B的第五特征向量;将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本。In some possible implementations, encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B. In terms of characters, the processor 502 is specifically configured to perform the following operations: perform word embedding processing on the character B to obtain a fourth feature vector of the character B; Corresponding part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the The encoder encodes the character B to obtain the fifth feature vector of the character B; the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the The character B is decoded to obtain the regular text corresponding to the character B.
在一些可能的实施方式中,在将所述字符B的第四特征向量输入到所述编码器进行编码,得到所述字符B的第五特征向量方面,处理器502具体用于执行以下操作:根据所述编码器上次编码输出的隐层向量、所述字符B的第四特征向量以及所述编码器的编码参数,对所述字符B进行编码,得到所述字符B的第五特征向量。In some possible implementations, in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, the processor 502 is specifically configured to perform the following operations: According to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B and the encoding parameters of the encoder, the character B is encoded to obtain the fifth feature vector of the character B .
在一些可能的实施方式中,在将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本方面,处理器502具体用于执行以下操作:将所述解码器上次解码输出的隐层向量与所述字符B对应的第五特征向量进行注意力机制运算,得到所述字符B对应的第六特征向量;将所述字符B的词类向量、所述字符B的第三特征向量、所述字符B的第六特征向量以及所述解码器上次解码的解码结果进行拼接,得到所述字符B的目标特征向量;根据所述编码器的编码参数以及所述字符B的目标特征向量,对所述字符B进行解码,得到所述字符B对应的正则字符。In some possible implementations, the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B In terms of text, the processor 502 is specifically configured to perform the following operations: perform an attention mechanism operation on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B, and obtain the corresponding value of the character B. The sixth feature vector; splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B, and the decoding result decoded by the decoder last time to obtain the The target feature vector of the character B; according to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
具体地,收发器501可为图4所述的实施例的文本正则装置400的获取单元401,处理器502可以为图4所述的实施例的文本正则装置400的处理单元402。Specifically, the transceiver 501 may be the acquisition unit 401 of the text regularization apparatus 400 of the embodiment shown in FIG. 4 , and the processor 502 may be the processing unit 402 of the text regularization apparatus 400 of the embodiment shown in FIG. 4 .
应理解,本申请中的文本正则装置可以包括智能手机(如Android手机、iOS手机、Windows Phone手机等)、平板电脑、掌上电脑、笔记本电脑、移动互联网设备MID(Mobile Internet Devices,简称:MID)或穿戴式设备等。上述文本正则装置仅是举例,而非穷举,包含但不限于上述文本正则装置。在实际应用中,上述文本正则装置还可以包括:智能车载终端、计算机设备等等。It should be understood that the text regularization device in this application may include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, PDAs, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc. The above text regularization device is only an example, not exhaustive, including but not limited to the above text regularization device. In practical applications, the above-mentioned text regularization apparatus may further include: an intelligent vehicle-mounted terminal, a computer device, and the like.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如上述方法实施例中记载的任何一种文本正则方法的部分或全部步骤。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any one of the text regularization methods described in the foregoing method embodiments some or all of the steps.
可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种文本正则方法的部分或全部步骤。Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any text regularization method.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with the present application, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory, Several instructions are included to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory ,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as RAM), magnetic disk or optical disk, etc.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been introduced in detail above, and the principles and implementations of the present application are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; at the same time, for Persons of ordinary skill in the art, based on the idea of the present application, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limitations on the present application.

Claims (20)

  1. 一种文本正则方法,包括:A text regularization method, including:
    获取待正则文本;Get the text to be regularized;
    对所述待正则文本进行字符切分,得到多个字符;performing character segmentation on the text to be regularized to obtain multiple characters;
    对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;Encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Describe context information for each of the multiple characters;
    根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
  2. 根据权利要求1所述的方法,其中,所述对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,包括:The method according to claim 1, wherein the encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters comprises:
    对所述多个字符中的每个字符进行编码,得到所述多个字符中的每个字符对应的字符向量;Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters;
    以字符A为中心,构建与所述字符A对应的第一文本,所述第一文本包括所述待正则文本中位于所述字符A之前的X个字符、所述字符A以及所述待正则文本中位于所述字符A之后的Y个字符,所述字符A为所述多个字符中的任意一个字符,其中,X和Y均为大于或等于1的整数;Taking the character A as the center, construct the first text corresponding to the character A, the first text including the X characters before the character A in the text to be regularized, the character A and the text to be regularized Y characters located after the character A in the text, where the character A is any one of the multiple characters, wherein X and Y are both integers greater than or equal to 1;
    将所述第一文本中的每个字符对应的字符向量进行拼接,得到所述字符A的第一特征向量,所述字符A的第一特征向量用于表示所述字符A在所述第一文本中的上下文信息。The character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, and the first feature vector of the character A is used to indicate that the character A is in the first Contextual information in the text.
  3. 根据权利要求1或2所述的方法,其中,所述根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本,包括:The method according to claim 1 or 2, wherein the regularization processing is performed on the text to be regularized according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, Obtain the regular text of the text to be regularized, including:
    根据所述多个字符中每个字符的第一特征向量,确定所述多个字符中每个字符的属性,所述多个字符中每个字符的属性包括标准字符或非标准字符;determining an attribute of each character in the plurality of characters according to the first feature vector of each character in the plurality of characters, where the attribute of each character in the plurality of characters includes a standard character or a non-standard character;
    将所述待正则文本中的标准字符作为该标注字符的正则字符;Using the standard characters in the text to be regularized as the regular characters of the marked characters;
    根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符;According to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the regular characters of the non-standard characters;
    将所述待正则文本中标准字符的正则字符与非标准字符的正则字符进行组合,得到所述待正则文本的正则文本。Combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular text of the text to be regularized.
  4. 根据权利要求3所述的方法,其中,所述根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符,包括:The method according to claim 3, wherein, according to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, encoding and decoding the non-standard characters is performed to obtain the Regular characters for non-standard characters, including:
    以字符B为中心,构建所述字符B对应的第二文本,所述第二文本包括所述待正则文本中位于所述字符B之前的M个字符、所述字符B以及所述待正则文本中位于所述字符B之后的N个字符,所述字符B为所述待正则文本中的任意一个非标准字符,其中,M和N均为大于或等于1的整数;Taking the character B as the center, construct a second text corresponding to the character B, where the second text includes the M characters located before the character B in the text to be regularized, the character B, and the text to be regularized N characters located after the character B in the above, and the character B is any non-standard character in the text to be regularized, wherein M and N are both integers greater than or equal to 1;
    通过双字节编码对所述第二文本中每个字符进行编码,得到所述第二文本中每个字符的第二特征向量;Encode each character in the second text by double-byte encoding to obtain a second feature vector of each character in the second text;
    将所述第二文本中每个字符的第二特征向量输入到Transformer-XL网络,得到所述字符B对应的第三特征向量,所述字符B的第三特征向量用于表示所述字符B在所述第二文本中的上下文信息;Input the second feature vector of each character in the second text into the Transformer-XL network to obtain the third feature vector corresponding to the character B, and the third feature vector of the character B is used to represent the character B contextual information in said second text;
    根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符。According to the attribute of the character B, the third feature vector of the character B, and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
  5. 根据权利要求4所述的方法,其中,所述根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符,包括:The method according to claim 4, wherein the character B is encoded and decoded according to the attribute of the character B, the third feature vector of the character B and the language type to obtain the character Regular characters corresponding to B, including:
    对所述字符B进行词嵌入处理,得到所述字符B的第四特征向量;performing word embedding processing on the character B to obtain the fourth feature vector of the character B;
    对所述字符B的属性进行编码,得到与所述字符B对应的词类向量;Encoding the attributes of the character B to obtain a part-of-speech vector corresponding to the character B;
    对所述语言类型进行编码,得到语言向量,并将所述语言向量分别作为编码器的编码参数以及解码器的解码参数;The language type is encoded to obtain a language vector, and the language vector is used as the encoding parameter of the encoder and the decoding parameter of the decoder respectively;
    将所述字符B的第四特征向量输入到所述编码器,对所述字符B进行编码,得到所述字符B的第五特征向量;The fourth feature vector of the character B is input into the encoder, and the character B is encoded to obtain the fifth feature vector of the character B;
    将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本。The part-of-speech vector of the character B and the fifth feature vector of the character B are input to the decoder, and the character B is decoded to obtain the regular text corresponding to the character B.
  6. 根据权利要求5所述的方法,其中,所述将所述字符B的第四特征向量输入到所述编码器进行编码,得到所述字符B的第五特征向量,包括The method according to claim 5, wherein the fourth feature vector of the character B is input into the encoder for encoding to obtain the fifth feature vector of the character B, comprising:
    根据所述编码器上次编码输出的隐层向量、所述字符B的第四特征向量以及所述编码器的编码参数,对所述字符B进行编码,得到所述字符B的第五特征向量。According to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B and the encoding parameters of the encoder, the character B is encoded to obtain the fifth feature vector of the character B .
  7. 根据权利要求5或6所述的方法,其中,所述将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本,包括:The method according to claim 5 or 6, wherein the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the The regular text corresponding to the above character B, including:
    将所述解码器上次解码输出的隐层向量与所述字符B对应的第五特征向量进行注意力机制运算,得到所述字符B对应的第六特征向量;Perform an attention mechanism operation on the hidden layer vector of the last decoding output of the decoder and the fifth feature vector corresponding to the character B, to obtain the sixth feature vector corresponding to the character B;
    将所述字符B的词类向量、所述字符B的第三特征向量、所述字符B的第六特征向量以及所述解码器上次解码的解码结果进行拼接,得到所述字符B的目标特征向量;Splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the target feature of the character B vector;
    根据所述编码器的编码参数以及所述字符B的目标特征向量,对所述字符B进行解码,得到所述字符B对应的正则字符。According to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
  8. 一种文本正则装置,包括:A text regularization device, comprising:
    获取单元,用于获取待正则文本;The acquisition unit is used to acquire the text to be regularized;
    处理单元,用于对所述待正则文本进行字符切分,得到多个字符;对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。a processing unit, configured to perform character segmentation on the text to be regularized to obtain a plurality of characters; encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters , wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the Describe the language type of the text to be regularized, perform regular processing on the text to be regularized, and obtain the regular text of the text to be regularized.
  9. 一种电子设备,包括:处理器和存储器,所述处理器与所述存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述电子设备执行以下方法:An electronic device, comprising: a processor and a memory, the processor is connected to the memory, the memory is used for storing a computer program, the processor is used for executing the computer program stored in the memory, so that the The electronic device performs the following methods:
    获取待正则文本;Get the text to be regularized;
    对所述待正则文本进行字符切分,得到多个字符;performing character segmentation on the text to be regularized to obtain multiple characters;
    对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;Encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Describe context information for each of the multiple characters;
    根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
  10. 根据权利要求9所述的电子设备,其中,执行所述对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,包括:The electronic device according to claim 9, wherein the encoding of each of the plurality of characters is performed to obtain a first feature vector of each of the plurality of characters, comprising:
    对所述多个字符中的每个字符进行编码,得到所述多个字符中的每个字符对应的字符向量;Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters;
    以字符A为中心,构建与所述字符A对应的第一文本,所述第一文本包括所述待正则文本中位于所述字符A之前的X个字符、所述字符A以及所述待正则文本中位于所述字符A之后的Y个字符,所述字符A为所述多个字符中的任意一个字符,其中,X和Y均为大于或等于1的整数;Taking the character A as the center, construct the first text corresponding to the character A, the first text including the X characters before the character A in the text to be regularized, the character A and the text to be regularized Y characters located after the character A in the text, where the character A is any one of the multiple characters, wherein X and Y are both integers greater than or equal to 1;
    将所述第一文本中的每个字符对应的字符向量进行拼接,得到所述字符A的第一特征向量,所述字符A的第一特征向量用于表示所述字符A在所述第一文本中的上下文信息。The character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, and the first feature vector of the character A is used to indicate that the character A is in the first Contextual information in the text.
  11. 根据权利要求9或10所述的电子设备,其中,执行所述根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本,包括:The electronic device according to claim 9 or 10, wherein performing the regularization on the text to be regularized according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized Process to obtain the regular text of the text to be regularized, including:
    根据所述多个字符中每个字符的第一特征向量,确定所述多个字符中每个字符的属性,所述多个字符中每个字符的属性包括标准字符或非标准字符;determining an attribute of each character in the plurality of characters according to the first feature vector of each character in the plurality of characters, where the attribute of each character in the plurality of characters includes a standard character or a non-standard character;
    将所述待正则文本中的标准字符作为该标注字符的正则字符;Using the standard characters in the text to be regularized as the regular characters of the marked characters;
    根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符;According to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the regular characters of the non-standard characters;
    将所述待正则文本中标准字符的正则字符与非标准字符的正则字符进行组合,得到所述待正则文本的正则文本。Combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular text of the text to be regularized.
  12. 根据权利要求11所述的电子设备,其中,执行所述根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符,包括:The electronic device according to claim 11, wherein the first feature vector corresponding to the language type and the non-standard characters in the text to be regularized is executed, and the non-standard characters are encoded and decoded to obtain Regular characters of the non-standard characters, including:
    以字符B为中心,构建所述字符B对应的第二文本,所述第二文本包括所述待正则文本中位于所述字符B之前的M个字符、所述字符B以及所述待正则文本中位于所述字符B之后的N个字符,所述字符B为所述待正则文本中的任意一个非标准字符,其中,M和N均为大于或等于1的整数;Taking the character B as the center, construct a second text corresponding to the character B, where the second text includes the M characters located before the character B in the text to be regularized, the character B, and the text to be regularized N characters located after the character B in the above, and the character B is any non-standard character in the text to be regularized, wherein M and N are both integers greater than or equal to 1;
    通过双字节编码对所述第二文本中每个字符进行编码,得到所述第二文本中每个字符的第二特征向量;Encode each character in the second text by double-byte encoding to obtain a second feature vector of each character in the second text;
    将所述第二文本中每个字符的第二特征向量输入到Transformer-XL网络,得到所述字符B对应的第三特征向量,所述字符B的第三特征向量用于表示所述字符B在所述第二文本中的上下文信息;Input the second feature vector of each character in the second text into the Transformer-XL network to obtain the third feature vector corresponding to the character B, and the third feature vector of the character B is used to represent the character B contextual information in said second text;
    根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符。According to the attribute of the character B, the third feature vector of the character B, and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
  13. 根据权利要求12所述的电子设备,其中,执行所述根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符,包括:The electronic device according to claim 12, wherein the encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, to obtain the The regular characters corresponding to the above character B, including:
    对所述字符B进行词嵌入处理,得到所述字符B的第四特征向量;performing word embedding processing on the character B to obtain the fourth feature vector of the character B;
    对所述字符B的属性进行编码,得到与所述字符B对应的词类向量;Encoding the attributes of the character B to obtain a part-of-speech vector corresponding to the character B;
    对所述语言类型进行编码,得到语言向量,并将所述语言向量分别作为编码器的编码参数以及解码器的解码参数;The language type is encoded to obtain a language vector, and the language vector is used as the encoding parameter of the encoder and the decoding parameter of the decoder respectively;
    将所述字符B的第四特征向量输入到所述编码器,对所述字符B进行编码,得到所述字符B的第五特征向量;The fourth feature vector of the character B is input into the encoder, and the character B is encoded to obtain the fifth feature vector of the character B;
    将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本。The part-of-speech vector of the character B and the fifth feature vector of the character B are input to the decoder, and the character B is decoded to obtain the regular text corresponding to the character B.
  14. 根据权利要求13所述的电子设备,其中,执行所述将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本,包括:The electronic device according to claim 13, wherein the inputting the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder is performed, and the character B is decoded to obtain the The regular text corresponding to the above character B, including:
    将所述解码器上次解码输出的隐层向量与所述字符B对应的第五特征向量进行注意力机制运算,得到所述字符B对应的第六特征向量;Perform an attention mechanism operation on the hidden layer vector of the last decoding output of the decoder and the fifth feature vector corresponding to the character B, to obtain the sixth feature vector corresponding to the character B;
    将所述字符B的词类向量、所述字符B的第三特征向量、所述字符B的第六特征向量以及所述解码器上次解码的解码结果进行拼接,得到所述字符B的目标特征向量;Splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the target feature of the character B vector;
    根据所述编码器的编码参数以及所述字符B的目标特征向量,对所述字符B进行解码,得到所述字符B对应的正则字符。According to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现以下方法:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:
    获取待正则文本;Get the text to be regularized;
    对所述待正则文本进行字符切分,得到多个字符;performing character segmentation on the text to be regularized to obtain multiple characters;
    对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,其中,所述多个字符中每个字符的第一特征向量用于表示所述多个字符中每个字符的上下文信息;Encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Describe context information for each of the multiple characters;
    根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本。According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
  16. 根据权利要求15所述的计算机可读存储介质,其中,执行所述对所述多个字符中的每个字符进行编码,得到所述多个字符中每个字符的第一特征向量,包括:The computer-readable storage medium according to claim 15, wherein the encoding of each of the plurality of characters is performed to obtain a first feature vector of each of the plurality of characters, comprising:
    对所述多个字符中的每个字符进行编码,得到所述多个字符中的每个字符对应的字符向量;Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters;
    以字符A为中心,构建与所述字符A对应的第一文本,所述第一文本包括所述待正则文本中位于所述字符A之前的X个字符、所述字符A以及所述待正则文本中位于所述字符A之后的Y个字符,所述字符A为所述多个字符中的任意一个字符,其中,X和Y均为大于或等于1的整数;Taking the character A as the center, construct the first text corresponding to the character A, the first text including the X characters before the character A in the text to be regularized, the character A and the text to be regularized Y characters located after the character A in the text, where the character A is any one of the multiple characters, wherein X and Y are both integers greater than or equal to 1;
    将所述第一文本中的每个字符对应的字符向量进行拼接,得到所述字符A的第一特征向量,所述字符A的第一特征向量用于表示所述字符A在所述第一文本中的上下文信息。The character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, and the first feature vector of the character A is used to indicate that the character A is in the first Contextual information in the text.
  17. 根据权利要求15或16所述的计算机可读存储介质,其中,执行所述根据所述多个字符中每个字符的第一特征向量以及所述待正则文本的语言类型,对所述待正则文本进行正则处理,得到所述待正则文本的正则文本,包括:The computer-readable storage medium according to claim 15 or 16, wherein performing the processing of the to-be-regularized text according to the first feature vector of each of the plurality of characters and the language type of the to-be-regularized text The text is regularized to obtain the regular text of the text to be regularized, including:
    根据所述多个字符中每个字符的第一特征向量,确定所述多个字符中每个字符的属性,所述多个字符中每个字符的属性包括标准字符或非标准字符;determining an attribute of each character in the plurality of characters according to the first feature vector of each character in the plurality of characters, where the attribute of each character in the plurality of characters includes a standard character or a non-standard character;
    将所述待正则文本中的标准字符作为该标注字符的正则字符;Using the standard characters in the text to be regularized as the regular characters of the marked characters;
    根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符;According to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the regular characters of the non-standard characters;
    将所述待正则文本中标准字符的正则字符与非标准字符的正则字符进行组合,得到所述待正则文本的正则文本。Combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular text of the text to be regularized.
  18. 根据权利要求17所述的计算机可读存储介质,其中,执行所述根据所述语言类型以及所述待正则文本中的非标准字符对应的第一特征向量,对所述非标准字符进行编解码处理,得到所述非标准字符的正则字符,包括:The computer-readable storage medium according to claim 17, wherein the first feature vector corresponding to the language type and the non-standard characters in the text to be regularized is executed to encode and decode the non-standard characters Processing to obtain the regular characters of the non-standard characters, including:
    以字符B为中心,构建所述字符B对应的第二文本,所述第二文本包括所述待正则文本中位于所述字符B之前的M个字符、所述字符B以及所述待正则文本中位于所述字符B之后的N个字符,所述字符B为所述待正则文本中的任意一个非标准字符,其中,M和N均为大于或等于1的整数;Taking the character B as the center, construct a second text corresponding to the character B, where the second text includes the M characters located before the character B in the text to be regularized, the character B, and the text to be regularized N characters located after the character B in the above, and the character B is any non-standard character in the text to be regularized, wherein M and N are both integers greater than or equal to 1;
    通过双字节编码对所述第二文本中每个字符进行编码,得到所述第二文本中每个字符的第二特征向量;Encode each character in the second text by double-byte encoding to obtain a second feature vector of each character in the second text;
    将所述第二文本中每个字符的第二特征向量输入到Transformer-XL网络,得到所述字符B对应的第三特征向量,所述字符B的第三特征向量用于表示所述字符B在所述第二文本中的上下文信息;Input the second feature vector of each character in the second text into the Transformer-XL network to obtain the third feature vector corresponding to the character B, and the third feature vector of the character B is used to represent the character B contextual information in said second text;
    根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符。According to the attribute of the character B, the third feature vector of the character B, and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
  19. 根据权利要求18所述的计算机可读存储介质,其中,执行所述根据所述字符B的属性、所述字符B的第三特征向量以及所述语言类型,对所述字符B进行编解码处理,得到所述字符B对应的正则字符,包括:The computer-readable storage medium according to claim 18, wherein the encoding/decoding process on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type is performed , obtain the regular character corresponding to the character B, including:
    对所述字符B进行词嵌入处理,得到所述字符B的第四特征向量;performing word embedding processing on the character B to obtain the fourth feature vector of the character B;
    对所述字符B的属性进行编码,得到与所述字符B对应的词类向量;Encoding the attributes of the character B to obtain a part-of-speech vector corresponding to the character B;
    对所述语言类型进行编码,得到语言向量,并将所述语言向量分别作为编码器的编码参数以及解码器的解码参数;The language type is encoded to obtain a language vector, and the language vector is used as the encoding parameter of the encoder and the decoding parameter of the decoder respectively;
    将所述字符B的第四特征向量输入到所述编码器,对所述字符B进行编码,得到所述字符B的第五特征向量;The fourth feature vector of the character B is input into the encoder, and the character B is encoded to obtain the fifth feature vector of the character B;
    将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本。The part-of-speech vector of the character B and the fifth feature vector of the character B are input to the decoder, and the character B is decoded to obtain the regular text corresponding to the character B.
  20. 根据权利要求19所述的计算机可读存储介质,其中,执行所述将所述字符B的词类向量以及所述字符B的第五特征向量输入到所述解码器,对所述字符B进行解码,得到所述字符B对应的正则文本,包括:The computer-readable storage medium of claim 19, wherein the inputting the part-of-speech vector of the character B and the fifth feature vector of the character B to the decoder to decode the character B is performed , obtain the regular text corresponding to the character B, including:
    将所述解码器上次解码输出的隐层向量与所述字符B对应的第五特征向量进行注意力机制运算,得到所述字符B对应的第六特征向量;Perform an attention mechanism operation on the hidden layer vector of the last decoding output of the decoder and the fifth feature vector corresponding to the character B, to obtain the sixth feature vector corresponding to the character B;
    将所述字符B的词类向量、所述字符B的第三特征向量、所述字符B的第六特征向量以及所述解码器上次解码的解码结果进行拼接,得到所述字符B的目标特征向量;Splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the target feature of the character B vector;
    根据所述编码器的编码参数以及所述字符B的目标特征向量,对所述字符B进行解码,得到所述字符B对应的正则字符。According to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
PCT/CN2021/083493 2020-12-31 2021-03-29 Text regularization method and apparatus, and electronic device and storage medium WO2022141855A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011644545.8A CN112765937A (en) 2020-12-31 2020-12-31 Text regularization method and device, electronic equipment and storage medium
CN202011644545.8 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022141855A1 true WO2022141855A1 (en) 2022-07-07

Family

ID=75698776

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083493 WO2022141855A1 (en) 2020-12-31 2021-03-29 Text regularization method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN112765937A (en)
WO (1) WO2022141855A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662499A (en) * 2022-03-17 2022-06-24 平安科技(深圳)有限公司 Text-based emotion recognition method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281791A1 (en) * 2008-05-09 2009-11-12 Microsoft Corporation Unified tagging of tokens for text normalization
US20170116177A1 (en) * 2015-10-26 2017-04-27 24/7 Customer, Inc. Method and apparatus for facilitating customer intent prediction
CN107680579A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN110765733A (en) * 2019-10-24 2020-02-07 科大讯飞股份有限公司 Text normalization method, device, equipment and storage medium
CN111832248A (en) * 2020-07-27 2020-10-27 科大讯飞股份有限公司 Text normalization method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281791A1 (en) * 2008-05-09 2009-11-12 Microsoft Corporation Unified tagging of tokens for text normalization
US20170116177A1 (en) * 2015-10-26 2017-04-27 24/7 Customer, Inc. Method and apparatus for facilitating customer intent prediction
CN107680579A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN110765733A (en) * 2019-10-24 2020-02-07 科大讯飞股份有限公司 Text normalization method, device, equipment and storage medium
CN111832248A (en) * 2020-07-27 2020-10-27 科大讯飞股份有限公司 Text normalization method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112765937A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN111444340B (en) Text classification method, device, equipment and storage medium
US10650102B2 (en) Method and apparatus for generating parallel text in same language
CN113807098B (en) Model training method and device, electronic equipment and storage medium
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
WO2023138188A1 (en) Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
WO2023040493A1 (en) Event detection
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
JP2023062150A (en) Character recognition model training, character recognition method, apparatus, equipment, and medium
CN114398943B (en) Sample enhancement method and device thereof
CN113255331B (en) Text error correction method, device and storage medium
CN110969005B (en) Method and device for determining similarity between entity corpora
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
WO2022141855A1 (en) Text regularization method and apparatus, and electronic device and storage medium
CN113076744A (en) Cultural relic knowledge relation extraction method based on convolutional neural network
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN115357710B (en) Training method and device for table description text generation model and electronic equipment
CN112765330A (en) Text data processing method and device, electronic equipment and storage medium
CN110516125A (en) Identify method, apparatus, equipment and the readable storage medium storing program for executing of unusual character string
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
CN112966501B (en) New word discovery method, system, terminal and medium
WO2021082570A1 (en) Artificial intelligence-based semantic identification method, device, and semantic identification apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912625

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912625

Country of ref document: EP

Kind code of ref document: A1