CN117610508A - Text processing method, device and storage medium - Google Patents

Text processing method, device and storage medium Download PDF

Info

Publication number
CN117610508A
CN117610508A CN202311607568.5A CN202311607568A CN117610508A CN 117610508 A CN117610508 A CN 117610508A CN 202311607568 A CN202311607568 A CN 202311607568A CN 117610508 A CN117610508 A CN 117610508A
Authority
CN
China
Prior art keywords
target
text
character string
word
letters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311607568.5A
Other languages
Chinese (zh)
Inventor
朱浩
孟彦伟
侯雷平
李丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Post Information Technology Beijing Co ltd
Original Assignee
China Post Information Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Post Information Technology Beijing Co ltd filed Critical China Post Information Technology Beijing Co ltd
Priority to CN202311607568.5A priority Critical patent/CN117610508A/en
Publication of CN117610508A publication Critical patent/CN117610508A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text processing method, a text processing device and a storage medium. Wherein the method comprises the following steps: acquiring a text to be detected, and determining a target character string in the text to be detected, wherein the target character string comprises a character string composed of target letters, and the target letters at least comprise pinyin letters and/or English letters; performing content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters; and performing language conversion on the updated text to be detected to obtain a target text. The method solves the technical problem that the text to be processed with abnormal content is difficult to perform efficient and accurate language conversion processing on the text. The method has the advantages of obtaining standard text content and further improving the efficiency and accuracy of language conversion processing of the text.

Description

Text processing method, device and storage medium
Technical Field
The present invention relates to the field of text processing technologies, and in particular, to a text processing method, apparatus, and storage medium.
Background
Text in a specific application scene is easy to have more writing specification problems. For example, the text in the international face list address record column is easy to have abnormal conditions such as multi-language mixing, misspelling of contents, no separation among words, and letter adhesion.
For the text, the text is converted by using a language conversion tool, so that ideal effects cannot be achieved due to the non-standardization of the content. The traditional manual processing mode is adopted, so that the problems of low efficiency, high cost, poor precision and the like are often caused.
Disclosure of Invention
The invention provides a text processing method, a text processing device and a storage medium, which are used for solving the technical problem that in the prior art, aiming at a text to be processed with abnormal content, efficient and accurate language conversion cannot be performed on the text.
According to an aspect of the present invention, there is provided a text processing method including:
acquiring a text to be detected, and determining a target character string in the text to be detected, wherein the target character string comprises a character string composed of target letters, and the target letters at least comprise pinyin letters and/or English letters;
performing content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters;
And performing language conversion on the updated text to be detected to obtain a target text.
According to another aspect of the present invention, there is provided a text processing apparatus including:
the text acquisition module is used for acquiring a text to be detected and determining a target character string in the text to be detected, wherein the target character string comprises a character string composed of target letters, and the target letters at least comprise pinyin letters and/or English letters;
the content reconstruction module is used for carrying out content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters;
and the text conversion module is used for carrying out language conversion on the updated text to be detected to obtain a target text.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the text processing method of any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a text processing method according to any one of the embodiments of the present invention.
According to the technical scheme, firstly, the text to be detected is obtained, and the target character string in the text to be detected is determined, wherein the target character string comprises character strings composed of target letters, and the target letters at least comprise pinyin letters and/or English letters, so that the deconstructing of text content based on the character strings is realized. And then, carrying out content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters, and reconstructing the text with abnormal content into a standard text which is convenient for language conversion. And finally, performing language conversion on the updated text to be detected to obtain a target text. The method solves the technical problem that the text to be processed with abnormal content is difficult to perform efficient and accurate language conversion processing on the text. The method has the advantages of obtaining standard text content and further improving the efficiency and accuracy of language conversion processing of the text.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a text processing method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a word stock structure according to a first embodiment of the present invention;
FIG. 3 is a flow chart of a text processing method according to a second embodiment of the present invention;
fig. 4 is a schematic structural view of a text processing device according to a third embodiment of the present invention;
fig. 5 is a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and "object" in the description of the present invention and the claims and the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present invention, where the method may be performed by a text processing device, and the text processing device may be implemented in hardware and/or software, and the text processing device may be configured in an electronic device. As shown in fig. 1, the method includes:
S110, acquiring a text to be detected, and determining a target character string in the text to be detected, wherein the target character string comprises a character string composed of target letters, and the target letters at least comprise pinyin letters and/or English letters.
In this embodiment, the text to be detected may be text to be detected. Alternatively, there may be a text in which there may be abnormality in the content, and detection and subsequent processing of the abnormal content in the text are awaited. For example, the text mainly written by English on the international express bill sent to China can be used. The text is written according to the spelling of the actual Chinese address in each place, so that the problems of format inaccuracy, misspelling, mixed Chinese characters and the like are easy to occur.
The target character string can be extracted from the text to be detected, and is based on English letters corresponding to English words in the text and character strings formed by pinyin letters in Chinese pinyin. The extraction of the target character string can be based on the writing mode of the text original text to be detected, and the English word and the Chinese pinyin which take spaces, punctuations and the like as boundaries are extracted as character strings composed of letters.
For example, for the following text to be detected, which mixes english words and chinese pinyin, english words "room" and "building" and the like and chinese pinyin "ping" and "fu" and the like may be extracted as target strings.
Examples: rom 201,building No.3,No.12,xing fu road,Wen Ming District
It can be understood that, for the content that simply indicates the street number or building number, such as "No.3" in the text to be detected, even if english letters exist, it may not be extracted as the target character string.
Optionally, determining the target character string in the text to be detected includes: and determining the target character string in the text to be detected according to the target symbol and/or the preset character string length in the text to be detected.
In this embodiment, the target symbol may be a space and punctuation as in the previous embodiments, or may be other symbols that can divide english words or chinese pinyin intervals. The preset character string length may be a maximum length threshold value of english words or chinese pinyin character strings preset according to actual situations.
It can be appreciated that in extreme cases, letters in the text to be detected may be stuck into longer letter sequences for various reasons, in which case content parsing using all the letter sequences as the target character string may reduce the parsing efficiency. Thus, for longer pinyin sticky sequences, a single pinyin maximum length threshold, e.g., 6, may be used as the preset string length. For longer english word sticky sequences, a maximum length threshold of a single english word, for example, 20, may be used as the preset string length.
S120, performing content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters.
In this embodiment, the pre-established word stock may be a comprehensive word stock with an english word stock and a chinese pinyin stock pre-established according to actual situations. The content reconstruction is performed on the target character string based on a pre-established word stock, and the spelling of the target character string can be queried and corrected based on standard English word spelling and Chinese pinyin spelling in the word stock and by combining algorithms such as spelling correction and the like. And after the spelling of the target character string is inquired and corrected, the target character string with the reconstructed content can be obtained and used as the reconstructed character string. The target character string in the text to be detected is updated based on the reconstructed character string, and the reconstructed character string can be used as a new target character string to replace the original target character string.
The target letter may be an english letter or a chinese pinyin letter contained in the target character string. The word strings corresponding to the words which can be formed by the target letters can be English words or Chinese pinyin which can be formed by the target letters, and the English word strings or the Chinese pinyin strings which are corresponding to the word strings or the Chinese pinyin strings in the pre-established word bank.
The word string corresponding to the word which can be formed by a plurality of target letters is searched in the word stock, and all word character nodes are traversed step by step in the word stock according to the original arrangement sequence of the letters by the target letters. Then, in the case of matching to english words or chinese pinyin that match a plurality of target letters, the matched word strings may be used as a query result. It can be understood that, under the condition that a plurality of target letters have errors and ambiguity, word strings corresponding to words which can be formed by the plurality of target letters can still be searched in the word stock.
For example, in the case where the target string is misspelled by "roomb", the word string of the english word room may be queried in the word stock based on r, o, and m in the target letter. When the target character string is "xin", and the word strings of the Chinese pinyin xin and the Chinese pinyin xin are searched in the word stock based on x, i, a and n in the target letters. In this case, the reconstruction string may further determine how to select the word string as the reconstruction string based on the size of the probability that the xin and the pinyin xin appear in the text to be detected.
Specifically, the target character string can be queried based on a pre-established word stock, and then the content of the target character string is reconstructed based on the queried word character string to obtain a reconstructed character string. And finally updating the target character string in the text to be detected based on the reconstructed character string.
In one embodiment, a word-level-based Chinese-English hybrid deconstructing algorithm can be adopted to reconstruct the target character string structure. The algorithm consists of two parts: on the one hand, a Chinese-English mixed word stock based on Chinese and English 26 letters can be constructed. In the whole library building process, the Chinese phonetic alphabet and English character can be combined. The maximum length of the standard pinyin can be set to be 6, and 6000 common Chinese characters are covered; the length of English letters can be set to be 20, and different English tenses can be fused into a word stock, so that the analysis capability of content inquiry and reconstruction processes is met. The word stock structure can be as shown in fig. 2, so as to improve the overall construction and query efficiency, and in the overall design process, the maximum depth of the tree is twenty, so as to meet the maximum query length. In addition, each query node may contain multiple attributes (e.g., node attributes such as whether to traverse and end of traversal) to satisfy the traversal operations of the lexicon.
On the other hand, in the process of analyzing the content by the algorithm, after traversing to the current node, whether the inquired pinyin and English are complete or not can be synchronously verified, and meanwhile, whether all inquiry results are exhausted or not can be judged according to actual algorithm logic. If the query of each layer is abnormal, the fact that the required fields do not exist in the chain library is indicated, the problem of spelling can be confirmed, and reasoning verification is carried out according to an algorithm. In the query process, the query needs to be analyzed and judged in stages according to the following formula:
wherein, node (x) i ) The i-th target letter corresponding to the target character string can be traversed. K can be a word string obtained by traversing a plurality of target letters. Y may be a string group of a plurality of word strings. node (x) ij ) The i-th target letter corresponding to the target character string can be traversed again from 26 English letters.
As shown in the formula, in the content analysis process, the word stock node may be called from the first letter by using the formulas (1) and (3), and whether the node is an end node may be determined by the attribute of each node. If yes, normal analysis can be continued, if not, abnormal matching points can be judged to occur, and 0 can be directly returned. Error correction of the information can then be achieved using equation (2), and if the current node n=s has no matable parameter values, after traversing 26 letters of the current layer in turn, reconstruction of the translated content structure can be achieved according to the processing logic of equations (1) and (2).
S130, performing language conversion on the updated text to be detected to obtain a target text.
In this embodiment, the language conversion of the updated text to be detected may be performed manually, or may be performed based on a pre-trained translation model or by using a translation tool software, etc. The language-converted text may then be used as the target text.
Optionally, performing language conversion on the updated text to be detected to obtain a target text, including: and inputting the updated text to be detected into a pre-trained text translation large model for language conversion to obtain a target text, wherein the text translation large model is obtained by training a neural network model based on a sample text consisting of at least one language and an expected text corresponding to the sample text.
In this embodiment, the text translation large model may be a neural network-based model, such as a neural network model of baichuan, chatglm. The neural network model can be trained based on sample texts composed of at least one language such as Chinese Pinyin, english and the like and expected texts corresponding to the sample texts, and a pre-trained text translation large model can be obtained. Since large models such as baichuan, chatglm are usually trained on a large amount of data, have a strong feature extraction capability, and can effectively handle various complex tasks. Based on training the neural network model and adjusting the model based on the sample text adapted to the specific business scene and the expected text corresponding to the sample text, the prediction result of the model can be more close to the real label, and the translation accuracy of the model is further improved. According to the embodiment of the invention, as the text to be detected is subjected to content reconstruction, the technical problem of inaccurate translation results caused by content errors in the process of using machine translation is solved. And then can realize that the text to be detected after the content reconfiguration is automatically subjected to language conversion based on the language conversion tool. Greatly improves the efficiency and accuracy of language conversion, and saves the labor cost in the language conversion working process.
According to the technical scheme, firstly, a text to be detected is obtained, and target character strings including English words and Chinese pinyin in the text to be detected are determined. And then, carrying out content reconstruction on the target character string based on a pre-established word library to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, thereby realizing the error correction of English word spelling and Chinese pinyin spelling. And finally, carrying out language conversion on the updated text to be detected by adopting a pre-trained neural network model to obtain a target text. The method solves the technical problem that the text to be processed with abnormal content is difficult to perform efficient and accurate language conversion processing on the text. The method has the advantages that the standard text content is obtained, and the effect of improving the language conversion processing efficiency and accuracy of the text by adopting text translation based on the neural network is further achieved.
Example two
Fig. 2 is a flowchart of a text processing method according to a second embodiment of the present invention, where the method for determining a reconstructed string based on a query result is specifically described based on the above embodiments. Reference is made to the description of this example for a specific implementation. The technical features that are the same as or similar to those of the foregoing embodiments are not described herein. As shown in fig. 2, the method includes:
S210, acquiring a text to be detected, and determining a target character string in the text to be detected.
S220, inquiring in a pre-established word stock according to the arrangement sequence of each target letter in the target character string, and determining a reconstruction character string based on an inquiry result, wherein a word inquiry tree is stored in the word stock, the word inquiry tree comprises a plurality of levels, each level comprises a plurality of nodes, at least part of nodes of different levels are connected with edges, the nodes are preset letters, and the edges between the nodes are determined based on words which can be formed by the preset letters.
In this embodiment, the arrangement order of each target letter in the target character string may be determined based on the writing order in the text to be detected, and the arrangement order of each target letter in each target character string. The word query tree may be a query tree of english words or chinese pinyin stored in a word stock. The nodes in the query tree may be preset letters, for example, 26 english letters that can satisfy both english words and chinese pinyin queries. Each level of the word query tree may include a plurality of nodes of 26 english letters for query error correction when querying for misspelled letters in the target letter.
Specifically, the query is performed in a pre-established word stock according to the arrangement sequence of each target letter in the target character string, or the node which can be matched is found by traversing each node of the first layer of the word query tree of the word stock from the initial letter according to the arrangement sequence of each target letter in the target character string. Each node of the second level may then be traversed in the second level of the word query tree for the second letter, finding a letter that can match. And so on until all the target letters have been traversed. When all target letters can be matched with a word of the word query tree, the node attribute of the last node in the query tree can be set as an ending node, so that the query is completed and a query result is obtained. When a certain letter in the target letters cannot find a matched node in the level of the letter due to writing errors, the rest 25 English letters in the word query tree level corresponding to the letter can be traversed, and the next letter can be continuously queried from all the matched letter nodes until the traversal query of the rest target letters is completed, so that a query result is obtained. The reconstructed string may be determined based on the word string corresponding to the query result.
Optionally, the query is performed in a pre-established word stock according to the arrangement sequence of each target letter in the target character string, including: and searching step by step in the word stock according to the arrangement sequence of each target letter in the target character string, traversing the word stock to search out all candidate character strings corresponding to the target character string, wherein the candidate character strings comprise word character strings and/or invalid character strings, and the invalid character strings are character strings which do not form words.
In this embodiment, the candidate strings corresponding to the target string may be candidate result strings for selection in the query result of the target string. The invalid character string may be a character string that cannot be determined to be a valid word or a valid pinyin based on english words or chinese pinyin in the word stock. The step-by-step inquiry is carried out in the word stock according to the arrangement sequence of each target letter in the target character string, namely the target letters respectively correspond to one level in the word inquiry tree according to the arrangement sequence, and then each target letter is traversed step by step in the word inquiry tree.
For example, when the target character is "room", a candidate character string room corresponding to the target character string room may be queried under the condition that the corresponding target character may correspond to a word character string room in the word stock.
In another example, when the target character is "rooz", if the corresponding target letter cannot correspond to the word string in the word stock, a plurality of candidate strings corresponding to the target character string may be queried through re-traversing the last target letter z, for example, the candidate strings may be candidate strings formed by the word strings of room, root, roof, and the like.
When the target character is "furong", since it cannot be judged whether the target character is a misspelled english word or an adherent chinese pinyin, there may be more than one candidate character string in the word stock corresponding to the target letter. For example, the candidate character strings formed by the two word character strings can be the Chinese pinyin fu and the Chinese pinyin rong. The character string may be a candidate character string composed of english word strings fur, on and an invalid character string g.
Optionally, determining the reconstruction string based on the query result includes: forming a candidate group by one or more candidate character strings which are associated with the query result in front and behind; and determining a target group in the candidate group, and taking the candidate character strings in the target group as reconstruction character strings.
In this embodiment, the one or more candidate strings associated with the query result may be one or more candidate strings obtained based on one target character. One or more candidate character strings associated in front of and behind the query result can be formed into a candidate group for determining a final query result corresponding to the target character. The target set of candidate sets may be a candidate set selected as the final query result. For example, when the target character is "furong", two candidate groups { fur, on, g } and { fu, rong } may be determined based on the candidate character string. And then the { fu, rong } can be taken as a target group in combination with the position of the target character in the text to be detected or other preset determination mechanisms.
Optionally, determining the target group in the candidate group includes: in the case where the number of candidate groups is one, the candidate group is taken as a target group; in the case where the number of candidate groups is plural, a target group is determined based on the invalid character string and/or the word frequency corresponding to the candidate character string in each candidate group.
In this embodiment, the target group is determined based on the invalid character strings in each candidate group, and the target group may be determined based on the number comparison of the invalid character strings. For example, for the two candidate groups { fur, on, g } and { fu, rong } determined in the foregoing embodiments, since the number of invalid character strings in { fu, rong } is smaller than { fur, on, g }, it is possible to determine { fu, rong } as the target group. For example, the minimum resolved invalidity F may be used for the discrimination, and the formula is as follows:
C=Min(F(Y))
where C may be the target group and F (Y) may be the invalid string calculation result of the candidate group Y. And counting invalid word conditions through comparison analysis, selecting an invalid analysis result of minimizing interception, and completing reconstruction of a translation content structure.
The determining of the target group based on the word frequency corresponding to the candidate character strings in each candidate group may be determining the target group based on the frequency of occurrence of the candidate character strings in the candidate group in a preset word stock. Taking the text in the international postal order as an example, a word stock can be preset in advance based on a large number of real effective address texts in the application scene, and then the target group can be determined according to the word frequency of the candidate character strings in the candidate group in the word stock. For example, when the target character is "roov", three candidate groups { room }, { root } and { roof } may be determined, and { room } may be determined as the target group based on that room has the highest word frequency in the international postal single address word stock.
Optionally, determining the target group based on the invalid character strings in each candidate group and/or word frequencies corresponding to the candidate character strings includes: determining a candidate group with the least invalid character strings as a reference group; in the case where the number of reference groups is plural, a target group is determined based on the word frequency corresponding to the word string in the reference group.
In this embodiment, the reference group may be a reference alignment candidate group when a target group is determined from a plurality of candidate groups based on an invalid character string in each candidate group. It will be appreciated that when the reference group is only one, the reference group may be taken as the target group. When the reference group is a plurality of, the target group can be determined by combining word frequencies corresponding to word strings in the reference group.
For example, when the target character is "sunan", it may be determined that the target character can correspond to two candidate sets { su, nan } and { sun, an } that contain the pinyin word strings, and the number of invalid strings is equal. Based on that su nan (place name: southwest) appears in the international postal address word library with a word frequency higher than sun an, { su, nan } can be determined as the target group.
S230, updating the target character string in the text to be detected based on the reconstructed character string.
S240, performing language conversion on the updated text to be detected to obtain a target text.
According to the technical scheme, firstly, a text to be detected is obtained, and target character strings including English words and Chinese pinyin in the text to be detected are determined. Then inquiring in a pre-established word stock according to the arrangement sequence of each target letter in the target character string; determining a reconstruction string based on the query result; and updating the target character string in the text to be detected based on the reconstructed character string, thereby realizing the correction of English word spelling and Chinese pinyin spelling. And finally, carrying out language conversion on the updated text to be detected by adopting a pre-trained neural network model to obtain a target text. The method solves the technical problem that the text to be processed with abnormal content is difficult to perform efficient and accurate language conversion processing on the text. The method has the advantages that the standard text content is obtained, and the effect of improving the language conversion processing efficiency and accuracy of the text by adopting text translation based on the neural network is further achieved.
Example III
Fig. 3 is a schematic structural diagram of a text processing device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a text acquisition module 310, a content reconstruction module 320, and a text conversion module 330.
The text obtaining module 310 is configured to obtain a text to be detected, and determine a target character string in the text to be detected, where the target character string includes a character string composed of target letters, and the target letters include at least pinyin letters and/or english letters; the content reconfiguration module 320 is configured to perform content reconfiguration on the target character string based on a pre-established word stock to obtain a reconfiguration character string, and update the target character string in the text to be detected based on the reconfiguration character string, where the word stock is used to query word character strings corresponding to words that can be formed by a plurality of target letters; the text conversion module 330 is configured to perform language conversion on the updated text to be detected to obtain a target text.
According to the technical scheme, firstly, the text to be detected is obtained, and the target character string in the text to be detected is determined, wherein the target character string comprises character strings composed of target letters, and the target letters at least comprise pinyin letters and/or English letters, so that the deconstructing of text content based on the character strings is realized. And then, carrying out content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters, and reconstructing the text with abnormal content into a standard text which is convenient for language conversion. And finally, performing language conversion on the updated text to be detected to obtain a target text. The method solves the technical problem that the text to be processed with abnormal content is difficult to perform efficient and accurate language conversion processing on the text. The method has the advantages of obtaining standard text content and further improving the efficiency and accuracy of language conversion processing of the text.
Based on the above technical solution, further, the content reconstruction module 320 includes a character string reconstruction unit.
The character string reconstruction unit is used for inquiring in a pre-established word stock according to the arrangement sequence of each target letter in the target character string, and determining a reconstructed character string based on an inquiry result, wherein a word inquiry tree is stored in the word stock, the word inquiry tree comprises a plurality of layers, each layer comprises a plurality of nodes, at least part of nodes of different layers are connected with edges, the nodes are preset letters, and the edges between the nodes are determined based on words which can be formed by the preset letters.
Based on the technical scheme, the character string reconstruction unit is further used for carrying out step-by-step inquiry in the word stock according to the arrangement sequence of each target letter in the target character string, traversing the word stock to inquire out all candidate character strings corresponding to the target character string, wherein the candidate character strings comprise word character strings and/or invalid character strings, and the invalid character strings are character strings which do not form words.
Based on the technical scheme, a character string reconstruction unit is further provided, and is specifically configured to combine one or more candidate character strings associated with the query result to form a candidate group; and determining a target group in the candidate group, and taking the candidate character strings in the target group as reconstruction character strings.
Based on the above technical solution, further, the character string reconstruction unit is specifically configured to take the candidate group as the target group when the number of candidate groups is one; in the case where the number of candidate groups is plural, a target group is determined based on the invalid character string and/or the word frequency corresponding to the candidate character string in each candidate group.
Based on the technical scheme, a character string reconstruction unit is further used for determining a candidate group with the least invalid character string as a reference group; in the case where the number of reference groups is plural, a target group is determined based on the word frequency corresponding to the word string in the reference group.
Based on the above technical solution, the text conversion module 330 is further specifically configured to input the updated text to be detected into a pre-trained text translation large model for language conversion, so as to obtain the target text, where the text translation large model is obtained by training the neural network model based on a sample text composed of at least one language and an expected text corresponding to the sample text.
Based on the above technical solution, the text obtaining module 310 is further specifically configured to determine the target character string in the text to be detected according to the target symbol and/or the preset character string length in the text to be detected.
The text processing device provided by the embodiment of the invention can execute the text processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example IV
Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as text processing methods.
In some embodiments, the text processing method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the text processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text processing method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A text processing method, comprising:
acquiring a text to be detected, and determining a target character string in the text to be detected, wherein the target character string comprises a character string composed of target letters, and the target letters at least comprise pinyin letters and/or English letters;
performing content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters;
And performing language conversion on the updated text to be detected to obtain a target text.
2. The method of claim 1, wherein performing content reconstruction on the target string based on a pre-established word stock to obtain a reconstructed string comprises:
inquiring in a pre-established word stock according to the arrangement sequence of each target letter in the target character string, and determining a reconstruction character string based on an inquiry result, wherein a word inquiry tree is stored in the word stock, the word inquiry tree comprises a plurality of layers, each layer comprises a plurality of nodes, at least part of nodes of different layers are connected with edges, the nodes are preset letters, and the edges between the nodes are determined based on words which can be formed by the preset letters.
3. The method according to claim 2, wherein the querying in a pre-established word stock according to the arrangement order of each target letter in the target character string includes:
and carrying out step-by-step inquiry in the word stock according to the arrangement sequence of each target letter in the target character string, traversing the word stock to inquire out all candidate character strings corresponding to the target character string, wherein the candidate character strings comprise word character strings and/or invalid character strings, and the invalid character strings are character strings which do not form words.
4. The method of claim 3, wherein the determining a reconstruction string based on the query results comprises:
forming a candidate group by one or more candidate character strings which are associated with the query result in front and behind;
and determining a target group in the candidate group, and taking the candidate character strings in the target group as reconstruction character strings.
5. The method of claim 4, wherein said determining a target set of said candidate sets comprises:
in the case where the number of candidate groups is one, the candidate group is taken as a target group;
and determining a target group based on the invalid character strings in each candidate group and/or the word frequency corresponding to the candidate character strings under the condition that the number of the candidate groups is a plurality of.
6. The method according to claim 5, wherein the determining the target group based on the invalid character string in each of the candidate groups and/or the word frequency corresponding to the candidate character string includes:
determining the candidate group with the least invalid character string as a reference group;
and determining a target group based on the word frequency corresponding to the word string in the reference group when the number of the reference groups is a plurality of.
7. The method of claim 1, wherein the performing language conversion on the updated text to be detected to obtain a target text includes:
and inputting the updated text to be detected into a pre-trained text translation large model for language conversion to obtain a target text, wherein the text translation large model is obtained by training a neural network model based on a sample text consisting of at least one language and an expected text corresponding to the sample text.
8. The method of claim 1, wherein the determining the target string in the text to be detected comprises:
and determining the target character string in the text to be detected according to the target symbol and/or the preset character string length in the text to be detected.
9. A text processing apparatus, comprising:
the text acquisition module is used for acquiring a text to be detected and determining a target character string in the text to be detected, wherein the target character string comprises a character string composed of target letters, and the target letters at least comprise pinyin letters and/or English letters;
the content reconstruction module is used for carrying out content reconstruction on the target character string based on a pre-established word stock to obtain a reconstructed character string, and updating the target character string in the text to be detected based on the reconstructed character string, wherein the word stock is used for inquiring word character strings corresponding to words which can be formed by a plurality of target letters;
And the text conversion module is used for carrying out language conversion on the updated text to be detected to obtain a target text.
10. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the text processing method of any of claims 1-7.
CN202311607568.5A 2023-11-28 2023-11-28 Text processing method, device and storage medium Pending CN117610508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311607568.5A CN117610508A (en) 2023-11-28 2023-11-28 Text processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311607568.5A CN117610508A (en) 2023-11-28 2023-11-28 Text processing method, device and storage medium

Publications (1)

Publication Number Publication Date
CN117610508A true CN117610508A (en) 2024-02-27

Family

ID=89959284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311607568.5A Pending CN117610508A (en) 2023-11-28 2023-11-28 Text processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN117610508A (en)

Similar Documents

Publication Publication Date Title
CN108038183B (en) Structured entity recording method, device, server and storage medium
US10783171B2 (en) Address search method and device
WO2020108063A1 (en) Feature word determining method, apparatus, and server
CN101131706A (en) Query amending method and system thereof
CN116340548A (en) Data processing method and device, electronic equipment and storage medium
CN115145924A (en) Data processing method, device, equipment and storage medium
CN113901214B (en) Method and device for extracting form information, electronic equipment and storage medium
CN113408660B (en) Book clustering method, device, equipment and storage medium
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN113010752B (en) Recall content determining method, apparatus, device and storage medium
EP3822818A1 (en) Method, apparatus, device and storage medium for intelligent response
US20210216710A1 (en) Method and apparatus for performing word segmentation on text, device, and medium
CN116484826B (en) Operation ticket generation method, device, equipment and storage medium
CN115186738B (en) Model training method, device and storage medium
CN115600592A (en) Method, device, equipment and medium for extracting key information of text content
CN112948573B (en) Text label extraction method, device, equipment and computer storage medium
CN117610508A (en) Text processing method, device and storage medium
CN114611625A (en) Language model training method, language model training device, language model data processing method, language model data processing device, language model data processing equipment, language model data processing medium and language model data processing product
CN114417862A (en) Text matching method, and training method and device of text matching model
CN113051896A (en) Method and device for correcting text, electronic equipment and storage medium
CN117874088B (en) Data fuzzy matching method, device, equipment and medium
CN116361517B (en) Enterprise word size duplicate checking method, device, equipment and medium
CN113268987B (en) Entity name recognition method and device, electronic equipment and storage medium
CN113822057B (en) Location information determination method, location information determination device, electronic device, and storage medium
CN117874308B (en) Train control data acquisition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination