WO2022141855A1 - Procédé et appareil de régularisation de texte, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de régularisation de texte, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022141855A1
WO2022141855A1 PCT/CN2021/083493 CN2021083493W WO2022141855A1 WO 2022141855 A1 WO2022141855 A1 WO 2022141855A1 CN 2021083493 W CN2021083493 W CN 2021083493W WO 2022141855 A1 WO2022141855 A1 WO 2022141855A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
text
characters
feature vector
regularized
Prior art date
Application number
PCT/CN2021/083493
Other languages
English (en)
Chinese (zh)
Inventor
李俊杰
蒋伟伟
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022141855A1 publication Critical patent/WO2022141855A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a text regularization method, apparatus, electronic device and storage medium.
  • the embodiments of the present application provide a text regularization method, apparatus, electronic device and storage medium, which can perform text regularization according to the language type of the text to be regularized and the feature vector of each character, so as to improve the text regularization efficiency and reduce labor costs.
  • an embodiment of the present application provides a text regularization method, including: obtaining text to be regularized; characterizing the text to be regularized to obtain multiple characters; encoding, to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
  • an embodiment of the present application provides a text regularization device, including: an acquisition unit, configured to acquire the text to be regularized; a processing unit, configured to perform character segmentation on the text to be regularized to obtain a plurality of characters; encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the multi-character context information of each character in the characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized The regular text of the text.
  • an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device executes a text regularization method
  • the text regularization method includes: obtaining the text to be regularized; characterizing the text to be regularized to obtain multiple characters; Encoding is performed to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters ;
  • the text to be regularized is subjected to regular processing to obtain the regular text of the text to be regularized.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute a text regularization method, where the text regularization method includes: obtaining the text to be regularized ; characterize the text to be regularized to obtain a plurality of characters; encode each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the The first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the text to be regularized
  • the language type of the to-be-regularized text is processed to obtain the regular text of the to-be-regularized text.
  • FIG. 1 is a schematic flowchart of a text regularization provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of encoding and decoding processing of non-standard characters according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of encoding and decoding of non-standard characters by an encoder and a decoder according to an embodiment of the present application.
  • FIG. 4 is a block diagram of functional units of a text regularization apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a text regularization apparatus provided by an embodiment of the present application.
  • the technical solution of the present application relates to the field of artificial intelligence technology, so as to realize regular text and help to promote the construction of smart cities.
  • the data involved in this application such as to-be-regular text and/or regular text, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
  • FIG. 1 is a schematic flowchart of a text regularization method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.
  • the text regularization device obtains the text to be regularized.
  • the text to be regularized may be manually input by the user in the information input field of the text regularization device, or it may be automatically read by the text regularization device from a text library.
  • the text to be regularized may be some If there is a document to be regularized, the text regularization device can sequentially read the text to be regularized from the document. Therefore, this application does not limit the acquisition of regular text.
  • the text regularization device performs character segmentation on the regular text to obtain multiple characters.
  • the to-be-regular text may be character-segmented by a tokenizer to obtain multiple characters, for example, the character-segmented regular text may be segmented by a word2vec tokenizer.
  • the characters may be English words, Chinese words, French words or special symbols, such as "$", "/", and so on.
  • the text regularization device encodes each of the multiple characters to obtain a first feature vector of each of the multiple characters, wherein the first feature vector of each of the multiple characters is used to represent multiple Contextual information for each character in a character.
  • each character in the plurality of characters is encoded to obtain a character vector corresponding to each character in the plurality of characters.
  • each character is segmented to obtain the letter string of each character; each letter in the letter string of each character is encoded to obtain the letter vector corresponding to each letter; finally, each letter is The letter vector is encoded to get a character vector for each character.
  • the character “Achieve” is processed as the letter string of "A”, “c”, “h”, “i”, “e”, “v”, “e”, and the letter string is The letter vector for each letter in is modeled as the input to the encoder, resulting in a character vector for the character "Achieve".
  • a first text corresponding to the character A is constructed with the character A as the center, wherein the character A is any one of the multiple characters, and the first text includes X characters located before the character A in the regular text , the character A, and the Y characters located after the character A in the regular text, where X and Y are both integers greater than or equal to 1; Splicing (ie, horizontal splicing), the first feature vector corresponding to the character A is obtained, wherein the first feature vector corresponding to the character A is used to represent the contextual information of the character A in the first text.
  • Splicing ie, horizontal splicing
  • the character A is the first character or the last character in the to-be-regularized text
  • you can fill in the preset character for example, you can fill in the start character S
  • the mode constructs the first text for character A.
  • the text regularization device performs regularization processing on the regular text according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized to obtain regular text of the text to be regularized.
  • an attribute of each character in the plurality of characters is determined, wherein the attribute of each character includes a standard character or a non-standard character; then, the The standard character in the text to be regularized is used as the regular character of the marked character, that is, the standard character itself is used as the regular character of the standard character, and according to the language type of the text to be regularized and the first corresponding to the non-standard character in the text to be regularized feature vector, encode and decode the standard character to obtain the regular character of the non-standard character; finally, combine the regular character of the standard character and the regular character of the non-standard character in the regular text to obtain the regular character of the regular text to be regular text.
  • a standard character refers to a character with the same pronunciation and writing.
  • its pronunciation and writing are the same, that is, both are "year”, and the regular character of the standard character is itself.
  • the non-standard characters involved in this application include but are not limited to the following:
  • the second text includes the M characters located before the character B in the text to be regularized, the character B and the text to be regularized.
  • N characters after character B where character B is any non-standard character in the text to be regularized, M and N are integers greater than or equal to 1; then, through double-byte encoding (Byte-Pair Encoding , BPE) to encode each character in the second text to obtain the second feature vector of each character in the second text.
  • BPE double-byte encoding
  • each character in the second text is split into letter strings, and according to the frequency of occurrence of the letter strings of all characters in the second text, the letter strings of each character are combined to obtain a new letter string; then, the new letter string of each character is input into the encoder for encoding, and the second feature vector of each character in the second text is obtained.
  • the problem of unregistered words in the second text can be solved by double-byte encoding.
  • the second feature vector of each character in the second text is input into the Transformer-XL network for feature extraction to obtain a third feature vector corresponding to character B, where the third feature vector of character B is used to represent character B Context information in the second text; finally, according to the first feature vector of the character B, the third feature vector of the character B and the language type of the text to be regularized, the character B is encoded and decoded to obtain the corresponding character B regular characters.
  • the process of encoding and decoding the character B will be described in detail later, and will not be described too much here.
  • first character segmentation is performed on the regular text to be treated, and then each character is encoded to obtain the first feature vector of each character; finally, according to the first feature vector of each character and
  • the language type of the text to be regularized, and the regularization of the regular text is performed, that is, the regularization of the regular text can be completed without manually writing regular rules, the efficiency of text regularization is improved, and the labor cost is saved.
  • the language type of the text to be regularized will be combined, so that texts in various languages can be regularized, so that the text regularization method of the present application has more usage scenarios.
  • FIG. 2 is a schematic flowchart of an encoding and decoding method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.
  • performing word embedding processing on character B is actually performing mapping processing on character B to obtain the fourth feature vector of character B.
  • the ASCII code of character B can be used as the fourth feature vector of character B.
  • encoding the attributes of character B is to map the part of speech to which character B belongs to obtain the part of speech vector of character B. For example, if character B is "currency", the GB232 code of "currency” is used as the character B. The part-of-speech vector.
  • the process of classifying the attributes of the character B only classifies whether the character is a standard character or a non-standard character. Subdivision is performed on non-standard characters. Therefore, the first feature vector of each character can only be used to distinguish whether each character is a standard character or a non-standard character, and cannot be further distinguished on non-standard characters.
  • the more detailed classification of character B and the part-of-speech vector of each non-standard character is mapped to obtain a more detailed category of each non-standard character.
  • the language type of the regular text is mapped to obtain the language vector of the regular text.
  • the GB2312 code of the Chinese representation of the language type for example, the language type is "English”, “Chinese”, “French”, etc.
  • the language type is "English”, “Chinese”, “French”, etc.
  • the encoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of encoder.
  • the character B is encoded according to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B, and the encoding parameters of the encoder (that is, the language vector), and the fifth feature vector corresponding to the character B is obtained. , and the hidden layer vector corresponding to character B.
  • the hidden layer vector output by the encoder last encoding is a preset hidden layer vector, for example, a zero vector.
  • the hidden layer vector finally output by the encoder is the hidden layer vector generated by the encoding process of character B. If other non-standard characters need to be encoded, it will be combined with The hidden layer vector corresponding to character B is used as the hidden layer vector of the next non-standard character to be encoded.
  • the character B is one of multiple consecutive non-standard characters in the regular text with the same attributes (that is, the parts of speech are exactly the same, for example, they are all dates in non-standard words), in order to speed up the processing of these multiple characters.
  • the encoding efficiency and encoding precision of non-standard characters can be encoded together with these multiple non-standard characters instead of encoding a certain non-standard character separately.
  • the hidden layer vector output by the last encoding contains the contextual semantic information of these multiple non-standard characters.
  • the multiple non-standard characters [X 1 , X 2 ,..., X n ] are successively encoded successfully, and the fifth feature vector [Y 1 , Y 2 ,..., Y n ] corresponding to the multiple non-standard characters is output. .
  • the three non-standard characters can be encoded continuously , and output the fifth feature vector of the three non-standard characters and the hidden layer vector obtained by the encoder for the last encoding.
  • the hidden layer vector contains the full-text semantic information of these three non-standard characters.
  • the decoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of decoder.
  • an attention mechanism operation is performed on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B.
  • the attention mechanism can be a general attention mechanism operation, for example, the fifth feature vector corresponding to the character B can be used as a key-value pair, that is, a key-value vector-value vector (Key-value); then, the decoder
  • the hidden layer vector output from the last decoding is used as the query vector (query) to perform the attention mechanism operation to obtain the sixth feature vector corresponding to the character B.
  • the attention mechanism operation involved in the follow-up is similar and will not be described.
  • the hidden layer vector output by the decoder last decoding is the hidden layer vector output by the encoder last encoding; if character B is not the first character to be decoded, Then the hidden layer vector output by the decoder last decoding is the hidden layer vector generated when the decoder decodes the previous character. Since the hidden layer vector output by the decoder last decoding (for example, the hidden layer vector output by the encoder last encoding) contains the contextual semantic information of the character B, through the attention mechanism operation, the key information of this decoding can be retained. Next, improve the decoding accuracy.
  • splicing the part-of-speech vector of character B, the third feature vector of character B, the sixth feature vector of character B and the decoding result decoded by the decoder last time to obtain the target feature vector of character B; according to the decoding of the encoder
  • the parameter (language vector) and the target feature vector of character B, decode character B to obtain the regular character corresponding to character B That is, use the decoding parameters of the decoder to operate the target feature vector to obtain the probability of each character falling into the standard dictionary, and use the standard character corresponding to the maximum probability as the regular character of the character B.
  • the decoding result decoded by the decoder last time is the decoding result generated in the process of decoding the character last time by the decoder (that is, the feature vector of the regular character of the previous character).
  • the decoding result of the last decoding is the feature vector of the preset character.
  • the preset character is the start symbol S
  • the feature vector of the start symbol S is spliced. , to indicate the start of this decoding.
  • character B is one of multiple consecutive non-standard characters in the text to be regularized with the same attributes (that is, the parts of speech are exactly the same, for example, all are dates in non-standard words), in order to speed up the decoding of non-standard characters
  • the plurality of non-standard characters will be sequentially decoded according to the fifth feature vector of the plurality of non-standard characters, and a certain non-standard character will not be decoded in isolation.
  • the hidden layer vector e0 output by the last encoding of the encoder is used, and the fifth feature vector [Y 1 of the plurality of non-standard characters [X 1 , X 2 ,..., X n ] is , Y 2 ,..., Y n ] perform the attention mechanism operation to obtain a sixth feature vector.
  • the hidden layer vector output by the encoder for the last encoding will contain the full text semantic information of the multiple non-standard characters [X 1 , X 2 ,..., X n ]
  • the decoding attention will be placed on the attention mechanism operation. Above the first character that needs to be decoded, thereby improving the decoding accuracy.
  • the part-of-speech vector of the multiple non-standard characters can be any one of the multiple non-standard characters
  • the part-of-speech vector of the non-standard characters, and the third feature vector of the plurality of non-standard characters is the average value of the third feature vectors of each non-standard character in the plurality of non-standard characters.
  • the first non-standard character to be decoded is decoded to obtain the first decoded decoding result Z 1 (that is, the first decoded non-standard character Regular characters of non-standard characters), and the first decoded hidden layer vector d 1 ; then, use the first decoded hidden layer vector d 1 , the first decoded decoding result Z 1 , multiple non-standard characters
  • the part-of-speech vector L and the third feature vector H of , and the fifth feature vector [Y 1 , Y 2 ,..., Y n ] of multiple non-standard characters are decoded for the second time, and the decoding result of the second decoding is obtained ( That is, the regular character of the second non-standard character that needs to be decoded) Z 2 , and the hidden layer vector of the second decoding; repeat the above steps until the multiple non-standard characters [X 1 , X 2 ,..., Regular character [Z 1 , Z 2
  • the non-standard character "$1 billion” as an example to illustrate the decoding process.
  • the first decoding process use the hidden layer vector output by the encoder for the last encoding and the fifth feature vector of the above three non-standard characters to perform the attention mechanism operation to obtain a sixth feature vector (because the first time The regular character "1", the sixth feature vector focuses on the character "1"); then, the sixth feature vector, the part-of-speech vector (the part-of-speech vectors of the three non-standard characters are the same), the third feature vector (This third feature vector is obtained by averaging the third feature vector of each non-standard character) and the feature vector of the start symbol S are spliced to obtain a target feature vector; the first decoding is performed according to the target feature vector, Obtain the vector of the character "1" (after mapping this vector, the regular character of the character "1” can be obtained as "one") and the hidden layer vector corresponding to the character "1”; then, perform the second decoding, using the first
  • the sixth feature vector, the part of speech vector, the third t feature vector and the first feature vector The vector of the character "1" output by the decoding is spliced to obtain a target feature vector vector, and the target feature vector is input into the decoder for decoding, and the vector of the character "billion” is obtained (after mapping this vector, "billion” is obtained
  • the regular character is "billion" and the hidden layer vector of the second decoding; then, the third decoding is performed, and the attention mechanism operation is performed using the hidden layer vector of the second decoding and the fifth feature vector of the above three characters , get a sixth feature vector, splicing the sixth feature vector, part-of-speech vector, third feature vector vector, and the character "billion” (the character corresponding to the regular character of "billion") output from the second decoding to get a Target feature vector, input this target feature vector into the decoder for decoding, and get the vector of the character "$" (after mapping, the regular vector of "$" can be obtained as "
  • the three consecutive standard characters "$1 billion” can be regularized into one billion dollars at one time, and then the above text to be regularized can be regularized as "Achieve record net income of about one billion dollars during the year".
  • FIG. 4 is a block diagram of functional units of a text regularization device provided by an embodiment of the present application.
  • the text regularization device 400 includes: an obtaining unit 401 and a processing unit 402, wherein: the obtaining unit 401 is used to obtain the text to be regularized; the processing unit 402 is used to perform character segmentation on the text to be regularized to obtain a plurality of characters; Each character in the plurality of characters is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized Regular text for regular text.
  • the processing unit 402 is specifically configured to: encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the first text corresponding to the character A is constructed.
  • the text includes X characters located before the character A in the text to be regularized, the character A, and Y characters located after the character A in the text to be regularized, and the character A is the plurality of any one of the characters, where X and Y are both integers greater than or equal to 1; the character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, The first feature vector of the character A is used to represent the context information of the character A in the first text.
  • the text to be regularized is subjected to regularization processing to obtain the text to be regularized
  • the attributes of the characters include standard characters or non-standard characters; the standard characters in the text to be regularized are used as regular characters of the marked characters; according to the language type and the first feature corresponding to the non-standard characters in the text to be regularized vector, encoding and decoding the non-standard characters to obtain the regular characters of the non-standard characters; combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular characters to be regular The regular text of the text.
  • the non-standard characters are encoded and decoded to obtain the non-standard characters.
  • the processing unit 402 is specifically configured to: take the character B as the center, construct a second text corresponding to the character B, and the second text includes the M texts located before the character B in the text to be regularized character, the character B, and the N characters located after the character B in the text to be regularized, where the character B is any non-standard character in the text to be regularized, where M and N are both greater than or an integer equal to 1; encode each character in the second text through double-byte encoding to obtain a second feature vector of each character in the second text; encode each character in the second text
  • the second feature vector of the character B is input into the Transformer-XL network, and the third feature vector corresponding to the character B is obtained, and the third feature vector of the character B is used to represent the character B in the second text.
  • encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B.
  • the processing unit 402 is specifically configured to: perform word embedding processing on the character B to obtain the fourth feature vector of the character B; encode the attributes of the character B to obtain the corresponding character B Part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the encoding decoder, encode the character B, and obtain the fifth feature vector of the character B; input the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder, and analyze the character B is decoded to obtain the regular text corresponding to the character B.
  • the processing unit 402 in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, is specifically configured to: according to the The hidden layer vector of the last encoding output of the encoder, the fourth feature vector of the character B, and the encoding parameters of the encoder are used to encode the character B to obtain the fifth feature vector of the character B.
  • the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B
  • the processing unit 402 is specifically configured to: perform an attention mechanism operation on the hidden layer vector decoded and output by the decoder last time and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B.
  • the target feature vector of the character B is decoded according to the encoding parameters of the encoder and the target feature vector of the character B to obtain the regular character corresponding to the character B.
  • FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device includes a processor and memory.
  • the electronic device may further include a transceiver.
  • the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 .
  • the memory 503 is used to store computer programs and data, and can transmit the data stored in the memory 503 to the processor 502 .
  • the processor 502 is configured to read the computer program in the memory 503 and perform the following operations: control the transceiver 501 to obtain the text to be regularized; perform character segmentation on the text to be regularized to obtain multiple characters; Each character is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent each character in the plurality of characters the context information; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by regularizing the text to be regularized.
  • the processor 502 is specifically configured to perform the following operations: Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the The first text includes X characters before the character A in the text to be regularized, the character A, and Y characters after the character A in the text to be regularized, where the character A is the Any one of the multiple characters, where X and Y are both integers greater than or equal to 1; the character vector corresponding to each character in the first text is spliced to obtain the first feature of the character A vector, the first feature vector of the character A is used to represent the context information of the character A in the first text.
  • the text to be regularized is subjected to regularization processing to obtain the text to be regularized
  • the processor 502 is specifically configured to perform the following operations: determine the attribute of each character in the The attributes of each character include standard characters or non-standard characters; the standard characters in the text to be regularized are used as the regular characters of the marked character; according to the language type and the first corresponding to the non-standard characters in the text to be regularized a feature vector, which encodes and decodes the non-standard characters to obtain the regular characters of the non-standard characters; combines the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the The regular text to be regular text.
  • the non-standard characters are encoded and decoded to obtain the non-standard characters.
  • the processor 502 is specifically configured to perform the following operations: centering on character B, construct a second text corresponding to the character B, where the second text includes the text to be regularized before the character B.
  • Context information according to the attribute of the character B, the third feature vector of the character B and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
  • encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B.
  • the processor 502 is specifically configured to perform the following operations: perform word embedding processing on the character B to obtain a fourth feature vector of the character B; Corresponding part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the The encoder encodes the character B to obtain the fifth feature vector of the character B; the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the The character B is decoded to obtain the regular text corresponding to the character B.
  • the processor 502 in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, is specifically configured to perform the following operations: According to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B and the encoding parameters of the encoder, the character B is encoded to obtain the fifth feature vector of the character B .
  • the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B
  • the processor 502 is specifically configured to perform the following operations: perform an attention mechanism operation on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B, and obtain the corresponding value of the character B.
  • the sixth feature vector splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B, and the decoding result decoded by the decoder last time to obtain the The target feature vector of the character B; according to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
  • the transceiver 501 may be the acquisition unit 401 of the text regularization apparatus 400 of the embodiment shown in FIG. 4
  • the processor 502 may be the processing unit 402 of the text regularization apparatus 400 of the embodiment shown in FIG. 4 .
  • the text regularization device in this application may include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, PDAs, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc.
  • smart phones such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.
  • tablet computers PDAs
  • notebook computers mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc.
  • MID Mobile Internet Devices
  • wearable devices etc.
  • the above text regularization device is only an example, not exhaustive, including but not limited to the above text regularization device.
  • the above-mentioned text regularization apparatus may further include: an intelligent vehicle-mounted terminal, a computer device, and the like.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any one of the text regularization methods described in the foregoing method embodiments some or all of the steps.
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any text regularization method.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
  • the integrated unit if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne le domaine technique de l'intelligence artificielle, et en particulier un procédé et un appareil de régularisation de texte, ainsi qu'un dispositif électronique et un support de stockage. Le procédé comprend : l'acquisition d'un texte à régulariser ; la réalisation d'une segmentation de caractères sur le texte à régulariser, de façon à obtenir une pluralité de caractères ; le codage de chaque caractère de la pluralité de caractères, de manière à obtenir un premier vecteur de caractéristique de chacun de la pluralité de caractères, le premier vecteur de caractéristiques de chacun de la pluralité de caractères étant utilisé pour représenter des informations de contexte de chacun de la pluralité de caractères ; et la mise en œuvre, en fonction du premier vecteur de caractéristiques de chacun de la pluralité de caractères et d'un type de langue du texte à régulariser, d'un traitement de régularisation sur le texte à régulariser, de façon à obtenir un texte régularisé du texte à régulariser. La présente invention est avantageuse pour améliorer l'efficacité et la précision de régularisation de texte.
PCT/CN2021/083493 2020-12-31 2021-03-29 Procédé et appareil de régularisation de texte, dispositif électronique et support de stockage WO2022141855A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011644545.8A CN112765937A (zh) 2020-12-31 2020-12-31 文本正则方法、装置、电子设备及存储介质
CN202011644545.8 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022141855A1 true WO2022141855A1 (fr) 2022-07-07

Family

ID=75698776

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083493 WO2022141855A1 (fr) 2020-12-31 2021-03-29 Procédé et appareil de régularisation de texte, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN112765937A (fr)
WO (1) WO2022141855A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662499A (zh) * 2022-03-17 2022-06-24 平安科技(深圳)有限公司 基于文本的情绪识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281791A1 (en) * 2008-05-09 2009-11-12 Microsoft Corporation Unified tagging of tokens for text normalization
US20170116177A1 (en) * 2015-10-26 2017-04-27 24/7 Customer, Inc. Method and apparatus for facilitating customer intent prediction
CN107680579A (zh) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 文本正则化模型训练方法和装置、文本正则化方法和装置
CN110765733A (zh) * 2019-10-24 2020-02-07 科大讯飞股份有限公司 一种文本规整方法、装置、设备及存储介质
CN111832248A (zh) * 2020-07-27 2020-10-27 科大讯飞股份有限公司 文本规整方法、装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090281791A1 (en) * 2008-05-09 2009-11-12 Microsoft Corporation Unified tagging of tokens for text normalization
US20170116177A1 (en) * 2015-10-26 2017-04-27 24/7 Customer, Inc. Method and apparatus for facilitating customer intent prediction
CN107680579A (zh) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 文本正则化模型训练方法和装置、文本正则化方法和装置
CN110765733A (zh) * 2019-10-24 2020-02-07 科大讯飞股份有限公司 一种文本规整方法、装置、设备及存储介质
CN111832248A (zh) * 2020-07-27 2020-10-27 科大讯飞股份有限公司 文本规整方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN112765937A (zh) 2021-05-07

Similar Documents

Publication Publication Date Title
CN111444340B (zh) 文本分类方法、装置、设备及存储介质
US10650102B2 (en) Method and apparatus for generating parallel text in same language
CN113807098B (zh) 模型训练方法和装置、电子设备以及存储介质
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN112287069B (zh) 基于语音语义的信息检索方法、装置及计算机设备
WO2023138188A1 (fr) Procédé et appareil d'apprentissage de modèle de fusion de caractéristiques, procédé et appareil de récupération d'échantillon, et dispositif informatique
CN113076739A (zh) 一种实现跨领域的中文文本纠错方法和系统
WO2024098533A1 (fr) Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil
WO2023040493A1 (fr) Détection d'événement
CN115759119B (zh) 一种金融文本情感分析方法、系统、介质和设备
CN113743101A (zh) 文本纠错方法、装置、电子设备和计算机存储介质
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
JP2023062150A (ja) 文字認識モデルトレーニング、文字認識方法、装置、機器及び媒体
CN114398943B (zh) 样本增强方法及其装置
CN113255331B (zh) 文本纠错方法、装置及存储介质
CN110969005B (zh) 一种确定实体语料之间的相似性的方法及装置
WO2022141855A1 (fr) Procédé et appareil de régularisation de texte, dispositif électronique et support de stockage
CN113076744A (zh) 一种基于卷积神经网络的文物知识关系抽取方法
CN116402166B (zh) 一种预测模型的训练方法、装置、电子设备及存储介质
CN115357710B (zh) 表格描述文本生成模型的训练方法、装置及电子设备
CN112765330A (zh) 文本数据处理方法、装置、电子设备和存储介质
CN110516125A (zh) 识别异常字符串的方法、装置、设备及可读存储介质
WO2022073341A1 (fr) Procédé et appareil de mise en correspondance d'entités de maladie fondés sur la sémantique vocale, et dispositif informatique
CN112966501B (zh) 一种新词发现方法、系统、终端及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912625

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912625

Country of ref document: EP

Kind code of ref document: A1