CN112257420A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN112257420A
CN112257420A CN202011133952.2A CN202011133952A CN112257420A CN 112257420 A CN112257420 A CN 112257420A CN 202011133952 A CN202011133952 A CN 202011133952A CN 112257420 A CN112257420 A CN 112257420A
Authority
CN
China
Prior art keywords
text
pinyin
phrase
polyphone
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011133952.2A
Other languages
Chinese (zh)
Other versions
CN112257420B (en
Inventor
蒋荣正
夏龙
马楠
杨明祺
郭常圳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202011133952.2A priority Critical patent/CN112257420B/en
Publication of CN112257420A publication Critical patent/CN112257420A/en
Application granted granted Critical
Publication of CN112257420B publication Critical patent/CN112257420B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present specification provides a text processing method and apparatus, wherein the text processing method includes: acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone; determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer; determining a phrase spelling sequence of the meta-phrase according to the ith spelling sequence, and inputting the phrase spelling sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase spelling sequence; under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed; and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.

Description

Text processing method and device
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to a text processing method and apparatus.
Background
With the development of internet technology, the requirements of more application scenes on the quantity and quality of data become higher and higher, and the data used in different scenes are different, in the field of machine learning, different models are constructed according to different use requirements, and different models need to be trained by using different sample data, for example, in an image processing scene, the models applied in the scene need to be trained by using image data; for example, in an audio processing scene, the model applied in the scene needs to be trained by using audio data; for example, in a text processing scene, text data is required to train a model and the like applied in the scene; and in order to be able to train out the model that satisfies user demand, need carry out the preliminary treatment to sample data in the data preparation stage, if beat the mark, it is right to found the sample, etc., all be the preparation operation that satisfies the model training demand, the precision of the model that this process direct influence was trained, prior art is beating the mark to sample data, all realize through the mode of artifical mark, not only inefficiency, and the mode of artifical mark can't guarantee the accuracy rate, and then cause the error when training the model easily, consequently need an effectual scheme in order to solve above-mentioned problem urgently.
Disclosure of Invention
In view of this, the embodiments of the present specification provide a text processing method. The present specification also relates to a text processing apparatus, a computing device, and a computer-readable storage medium, which are used to solve the technical defects in the prior art.
According to a first aspect of embodiments herein, there is provided a text processing method including:
acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
Optionally, before the step of obtaining the initial text carrying the multi-tone character identifier is executed, the method further includes:
acquiring a text to be processed, and carrying out normalization processing on the text to be processed to obtain a standard text;
determining standard polyphone characters in the standard text based on a preset polyphone dictionary, and marking the standard polyphone characters;
and obtaining a standard text carrying the multi-tone character identification according to the marking result, and writing the standard text carrying the multi-tone character identification into a standard text library.
Optionally, the obtaining an initial text carrying a multi-tone character identifier includes:
under the condition that an updating request for updating the polyphone text library is received, extracting the initial text carrying polyphone identifications in the standard text library based on the updating request, wherein the polyphone identifications are used for marking the character position of at least one polyphone contained in the initial text.
Optionally, the determining an ith pinyin sequence corresponding to the initial text includes:
and inputting the initial text into a pinyin generation module for processing to obtain an ith pinyin sequence corresponding to the initial text output by the pinyin generation module, wherein i is a value from 1 and is a positive integer.
Optionally, constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, including:
determining a character position of the polyphone in the initial text based on the polyphone identification;
determining adjacent character positions adjacent to the character positions through a preset selection strategy, and determining adjacent words corresponding to the adjacent character positions according to the initial text;
and constructing at least one meta phrase consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
Optionally, the determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence includes:
preprocessing the initial text to obtain a plurality of initial characters, and preprocessing the meta-phrase to obtain a plurality of meta-characters;
determining the pinyin of each initial character in the plurality of initial characters according to the ith pinyin sequence;
determining a pinyin for each of the plurality of meta-characters based on the pinyin for each of the plurality of initial characters;
and generating the phrase pinyin sequence according to the pinyin of each element character in the plurality of element characters.
Optionally, if the meta phrase and the reference phrase are not consistent, i is incremented by 1, and after the step of determining the ith pinyin sequence corresponding to the initial text is executed, the method further includes:
detecting whether the (i + 1) th pinyin sequence is consistent with the (i) th pinyin sequence;
if not, executing the step of constructing at least one meta phrase containing the polyphone according to the polyphone identification and the initial text;
and if the initial text is consistent with the non-standard text, writing the initial text into a non-standard text library.
Optionally, the creating a text pinyin group based on the multi-pinyin character identifier, the initial text, and the ith pinyin sequence includes:
determining the pinyin position of the pinyin corresponding to the polyphone in the ith pinyin sequence based on the polyphone identifier;
extracting pinyin corresponding to the polyphone from the ith pinyin sequence according to the pinyin position;
and integrating the initial text, the polyphone identification and the pinyin corresponding to the polyphone to obtain the text pinyin group.
Optionally, after the steps of creating a text pinyin group based on the polyphonic identifier, the initial text, and the ith pinyin sequence, and writing the text pinyin group into a polyphonic text library are executed, the method further includes:
under the condition that a reading request submitted by aiming at the polyphone text library is received, reading a training text in the polyphone text library according to the reading request;
under the condition that a reading request submitted by aiming at the polyphone text library is received, reading a training text pinyin group in the polyphone text library according to the reading request;
analyzing the training text pinyin group to obtain a training initial text and a training pinyin sequence;
and training an initial pinyin marking model based on the training initial text and the training pinyin sequence to obtain a target pinyin marking model.
Optionally, the initial text is an initial chinese text, and the pinyin included in the ith pinyin sequence has tones.
According to a second aspect of embodiments herein, there is provided a text processing apparatus including:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is configured to acquire an initial text carrying a polyphone identifier, and the initial text comprises at least one polyphone;
a determining module configured to determine an ith pinyin sequence corresponding to the initial text, and construct at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
the processing module is configured to determine a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, input the phrase pinyin sequence to the text generation module for processing, and obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1 by self, and the determining module is operated;
and under the condition that the meta phrase is consistent with the reference phrase, operating a writing module, wherein the writing module is configured to create a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and write the text pinyin group into a polyphone text library.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:
acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text processing method.
The text processing method provided by the specification comprises the steps of determining a pinyin sequence of an initial text after the initial text containing polyphones is obtained, constructing at least one element phrase containing the polyphones based on polyphone marks carried by the initial text, determining a phrase pinyin sequence of the element phrase according to the obtained pinyin sequence, generating a reference phrase based on the phrase pinyin sequence, checking the correctness of the pinyin sequence by comparing the reference phrase with the element phrase, re-determining a new pinyin sequence of the initial text if the checking result is inconsistent, executing the process until the checking result is consistent, determining the correct pinyin of the polyphones in the initial text, integrating the pinyin sequence, the polyphone marks and the initial text under the condition that the checking result is consistent into a text pinyin group, and writing the text pinyin group into a polyphone text library, the method and the device have the advantages that when the pinyin marking is carried out on the polyphone in the initial text, the correct pinyin of the polyphone can be determined in a checking mode, manpower and material resources are saved, the correct rate of the finally created text pinyin group can be effectively guaranteed, the construction of the polyphone text library can be efficiently and quickly completed, so that the development of corresponding services cannot be influenced by the problems of the quality and the quantity of data in the library when the polyphone text library is used by downstream services, and the service completion efficiency of the downstream services is further improved.
Drawings
Fig. 1 is a flowchart of a text processing method provided in an embodiment of the present specification;
fig. 2 is a schematic diagram of a text processing method provided in an embodiment of the present specification;
fig. 3 is a schematic diagram of a polyphone dictionary in a text processing method according to an embodiment of the present specification;
fig. 4 is a schematic diagram of a text to be processed in a text processing method according to an embodiment of the present specification;
fig. 5 is a schematic diagram of a normalization processing procedure in a text processing method according to an embodiment of the present specification;
FIG. 6 is a process flow diagram of a model training process provided by an embodiment of the present description;
fig. 7 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present disclosure;
fig. 8 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Tone: this is a change in the level of sound. In modern Chinese phonetics, tone refers to the inherent tone in Chinese syllables, and can distinguish the height and the rise and fall of the meaning sound; five tones are contained in Chinese, corresponding to yin Ping (-), yang Ping (- /), upward (v), and downward (v) \\) and soft. If the pinyin for mom is m ā, the corresponding tone is yin Ping; the pinyin of the hemp is m & lt, and the corresponding tone is Yangping; the pinyin of the horse is m { hacek over (a) }, the corresponding tone is upward sound, the pinyin for the abuse is m < a >, and the corresponding tone is de-sound; the pinyin of which is ma and the corresponding tone is soft.
In the present specification, a text processing method is provided, and the present specification relates to a text processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
In practical application, because the characteristics of polyphones and the pronunciations of different characters in different texts are different, and the pinyin of the polyphones needs to be determined according to context semantics when being labeled, the prior art generally adopts a manual labeling mode when the polyphones in the texts are labeled, namely, the correct pronunciations of the polyphones in the texts are determined in a manual verification mode and then are labeled, the process is time-consuming and labor-consuming, and the language success of an auditor needs to be ensured to be high, so that the correct pinyins of the polyphones can be labeled, the quality and the quantity of data written into a polyphone text library cannot be ensured, and therefore, the problem that the efficiency of updating or constructing the polyphone text library needs to be solved urgently is solved.
The text processing method provided by the specification comprises the steps of determining a pinyin sequence of an initial text after the initial text containing polyphones is obtained, constructing at least one meta-phrase containing polyphones based on polyphone identification carried by the initial text, determining a phrase pinyin sequence of the meta-phrase according to the obtained pinyin sequence, generating a reference phrase based on the phrase pinyin sequence, checking the correctness of the pinyin sequence by comparing the reference phrase with the meta-phrase, re-determining a new pinyin sequence of the initial text if the checking result is inconsistent, executing the process until the checking result is consistent, determining the correct pinyin of the polyphones in the initial text, and then checking the pinyin sequence under the condition that the checking result is consistent, The polyphone identification and the initial text are integrated into a text pinyin group and written into the polyphone text library, so that when the polyphone in the initial text is subjected to pinyin marking, correct pinyin of the polyphone can be determined in a verification mode, manpower and material resources are saved, the correct rate of the finally created text pinyin group can be effectively guaranteed, the construction of the polyphone text library is efficiently and quickly completed, the development of corresponding services cannot be influenced by the problems of data quality and quantity in the library when downstream services use the polyphone text library, and the service completion efficiency of the downstream services is further improved.
Fig. 1 is a flowchart illustrating a text processing method according to an embodiment of the present specification, which specifically includes the following steps:
step S102, obtaining an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone.
In specific implementation, the polyphone specifically refers to a Chinese character with multiple pronunciations, such as "has", where li a (a tone of 3-up) o is read in the phrase "understanding", and le (a tone of 5-soft) is read in the phrase "good"; correspondingly, the initial text specifically refers to a text containing at least one polyphone, and the polyphone identifier specifically refers to an identifier for marking the position of the polyphone in the initial text; in addition, because the pinyin is labeled, the tone of the pinyin cannot be labeled, in order to correctly identify the correct pronunciation of each polyphone, the correct pronunciation of the polyphone is labeled in a digital combination pinyin manner, if the correct pronunciation of "is li { hacek }, then it will be expressed in a manner of combining numbers with pinyin: liao3 (indicating 3-up tone); in this embodiment, for convenience of description, pinyin with a tone of yin and a tone of yin is combined with a number 1 and then expressed, for example, the pinyin for mom is m ā, and the expression form is ma 1; combining Pinyin with tone of Yangping and number 2 for expression, wherein the Pinyin of hemp is m & lt, and the expression form is ma 2; the phonetic alphabet with the tone as the upper tone is combined with the number 3 for expression, for example, the phonetic alphabet of horse is m { hacek over (a) }, and the expression form is ma 3; the expression is carried out after the pinyin with the tone being the de-voicing is combined with the number 4, for example, the expletive pinyin is m-a, and the expression form is ma 4; pinyin with the tone of light sound is combined with the number 5 for expression, for example, the Pinyin is ma, and the expression form is ma 5. It should be noted that, in practical applications, in order to express the correct tone of each pinyin in detail, other combination manners may be selected to implement, for example, the pinyin and the symbol (# @ rah.) are expressed in combination, and the specific combination expression manner is not limited herein.
The text processing method provided in this embodiment is to describe the text processing method by taking the initial text as an initial chinese text, where the initial text includes one polyphone as an example, and when the corresponding initial text includes two or more polyphones, the corresponding description content in this embodiment may be referred to, which is not described herein in detail.
Referring to the schematic diagram of the text processing method shown in fig. 2, after an initial text carrying polyphonic identifiers (expressed as (1, 2) × ab × …, where 1 and 2 represent polyphonic identifiers for indicating the positions of polyphonic characters in the text, i.e. the second and third chinese characters from left to right are polyphonic characters, for convenience of expression, the second chinese character is a polyphonic character, and the third chinese character is a polyphonic character) is obtained from the canonical text library, determining a plurality of pinyin sequences corresponding to the initial text, then generating a reference text based on one pinyin sequence in the pinyin sequences by using a text generation module, checking whether the pinyin sequence is correct and marks the pinyin of each Chinese character in the initial text by checking the reference text with the initial text, if the two are consistent, the pinyin marking is correct, and a text pinyin group is directly created and written into a polyphone text library; if the two are not consistent, selecting the next pinyin sequence to generate a new reference text, then executing the verification process until the pinyin sequences which are consistent with the two are obtained, and then forming a text pinyin group to write in the polyphone text library.
In practical application, the situation that the reference text generated by the multiple pinyin sequences contained in the multiple pinyin sequences is inconsistent with the initial text may also exist, and at this time, it is stated that the pinyin annotation cannot be correctly performed on the initial text, and the initial text can be deleted from the standard text library, so that the waste of storage resources caused by the occupation of useless data in storage space is reduced.
Based on this, in order to obtain an initial text carrying polyphone and polyphone identifiers, a large number of texts to be processed need to be collected to be used for preferentially constructing a standard text library for subsequently perfecting the polyphone text library, in this embodiment, a specific implementation manner is as follows:
acquiring a text to be processed, and carrying out normalization processing on the text to be processed to obtain a standard text;
determining standard polyphone characters in the standard text based on a preset polyphone dictionary, and marking the standard polyphone characters;
and obtaining a standard text carrying the multi-tone character identification according to the marking result, and writing the standard text carrying the multi-tone character identification into a standard text library.
Specifically, the text to be processed specifically refers to a Chinese text captured by big data, the standard text specifically refers to a text obtained by processing the text to be processed, the polyphone dictionary specifically refers to a dictionary storing a large number of polyphones and corresponding pinyins thereof, see a schematic diagram of the polyphone dictionary shown in fig. 3, in the polyphone dictionary, pinyins corresponding to the polyphones are expressed by a relationship between Chinese characters and pinyin mapping, for example, corresponding pinyins include zhao2, zhao2, zhao1 and zhe5, it is to be noted that the polyphone dictionary shown in fig. 3 is a small part, and in practical application, the polyphones corresponding to the polyphones and the polyphones that need to be used can be written into the polyphone dictionary according to actual needs. The standard polyphones specifically refer to polyphones contained in the standard text.
Based on this, because the polyphone dictionary can only determine polyphones in the text and cannot correctly label correct pinyins of the polyphones, after the text to be processed is collected, polyphone marking can be performed on the standard text through the polyphone dictionary, so that the standard text carrying the polyphone identifications is obtained and written into the standard text library, and when the polyphone text library is subsequently updated or constructed, an initial text meeting the requirement of pinyin labeling can be extracted from the standard text library.
In the process, because different texts cannot be directly applied to the marking of polyphones, referring to the texts in fig. 4, such as the second text and the third text, if the marking of polyphones is directly performed, when subsequent verification is performed, the reference word group and the meta word group cannot be correctly compared, and the update or the construction of the polyphone text library cannot be completed, so that the method can be effectively applied to the subsequent processing process, after the text to be processed is obtained, the text to be processed can be normalized to obtain the standard text, and the standard text can be subjected to subsequent polyphone marking.
In the normalization process, all non-Chinese characters or symbols in the text to be processed are converted into Chinese characters, that is, the digital normalization process, the symbol normalization process, the unit normalization process and the translation normalization process are performed, for example, "1" in the text to be processed is converted into "one"; the English punctuation mark in the text to be processed is converted into Chinese punctuation mark. "; converting a unit 'kg' in a text to be processed into a Chinese unit 'kg'; english hi in the text to be processed is converted into Chinese character 'hello', and the like, so that the Chinese text meeting the labeling requirement is obtained, and then polyphone labeling is carried out.
Based on this, referring to the schematic diagram of the normalization processing process shown in fig. 5, after the text to be processed is collected, the text to be processed is normalized to obtain a standard text corresponding to the text to be processed, then the standard polyphone in the standard text is determined based on a preset polyphone dictionary, and the standard polyphone is marked at the same time, that is, the position of the polyphone in the text to be processed is marked, the standard text carrying the polyphone identifier can be obtained according to the marking result, and finally the standard text carrying the polyphone identifier is written into a standard text library, so that the standard text library meeting the use requirement is created, and the text library can be conveniently used when the polyphone text library is updated or constructed subsequently.
Further, when the polyphone text library needs to be updated, extracting the initial text carrying the polyphone identification from the standard text library according to the received update request; the method is used for a subsequent text processing process so as to update a polyphone text library meeting the downstream use requirement, wherein the polyphone identifier is used for marking the character position of at least one polyphone contained in the initial text.
For example, the collected text to be processed is as shown in fig. 4, which respectively indicates "facing the sun and riding in the sun", "requires a perimeter of O to know what", "0 has reciprocal" … …; at this time, it is determined that normalization processing needs to be performed on the 'what the circumference of one circle needs to be known' and the '0 has the reciprocal' to obtain a corresponding labeled text, namely 'what the circumference of one circle needs to be known' and 'zero has the reciprocal', then the preset polyphone dictionary is used for marking polyphone in each standard text, and the text written into the standard text library as shown in fig. 5 can be obtained according to the marking result; namely the initial text corresponding to the car facing the sun and carrying the multi-tone character mark is [ (1, 2, 5) -car facing the sun ] … ….
In summary, in order to facilitate subsequent updating or construction of the polyphone text library, the standard text library is constructed before text processing, collected texts to be processed are normalized and then marked with polyphones, and the marked standard text is written into the standard text library, so that subsequent updating or construction of the polyphone text library can be completed by using more standard initial texts, thereby not only ensuring the data quality of the polyphone text library, but also improving the efficiency of updating or constructing the polyphone text library and further improving the efficiency of completing downstream services.
And step S104, determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer.
Specifically, on the basis of obtaining the initial text carrying the multi-tone character identification, an ith pinyin sequence corresponding to the initial text is generated preliminarily, the ith pinyin sequence is specifically a pinyin sequence generated according to one pronunciation of the multi-tone character in the initial text, then the subsequent verification process is carried out through the pinyin sequence, when the verification result meets the condition of creating a text pinyin group, the ith pinyin sequence can be written into a multi-tone character text library, when the verification result does not meet the condition of creating the text pinyin group, a next pinyin sequence can be generated according to the next pronunciation of the multi-tone character in the initial text, and then the subsequent verification process is carried out until the condition of creating is met or no new pronunciation can be generated to obtain a new pinyin sequence, and the process is finished; it should be noted that the pinyin included in the ith pinyin sequence has tones.
Based on this, after generating the ith pinyin sequence for the initial text (where i is a positive integer starting from 1, and the maximum value of i is the number of the readings of the polyphones), it is described that whether the generated ith pinyin sequence is correct or not is checked, that is, whether the pinyin corresponding to the polyphones in the generated ith pinyin sequence is the correct reading of the polyphones in the initial text is checked, and in order to improve the checking accuracy, at least one meta-phrase containing the polyphones may be constructed according to the polyphone identifier and the initial text for subsequently generating the reference phrase, and the checking of the correctness of the polyphones is realized by comparing the reference phrase with the meta-phrase, where the meta-phrase specifically refers to a phrase containing the polyphones, and the characters forming the phrase are all present in the initial text, and adjacent to the polyphones, so that a meta phrase with smooth semantics and expression intention can be generated, and the accuracy of subsequent verification is further improved.
Further, in the process of determining the ith pinyin sequence corresponding to the initial text, a preset pinyin generation module may be used, where the pinyin generation module may be a pinyin generation model or a pinyin generation tool (the pinyin for each chinese character may be generated by querying a dictionary), that is, the initial text is input to the pinyin generation module to be processed, and the ith pinyin sequence corresponding to the initial text output by the pinyin generation module may be obtained, where it needs to be noted that the output result of the pinyin generation module may be one or more pinyin sequences, and then one of the one or more pinyin sequences is selected as the ith pinyin sequence corresponding to the initial text.
Furthermore, after the ith pinyin sequence is determined, at this time, at least one meta-phrase including the polyphone may be constructed by combining the polyphone identifier and the initial text, so as to improve the accuracy of checking the pinyin of the polyphone, in this embodiment, the specific implementation manner is as follows:
determining a character position of the polyphone in the initial text based on the polyphone identification;
determining adjacent character positions adjacent to the character positions through a preset selection strategy, and determining adjacent words corresponding to the adjacent character positions according to the initial text;
and constructing at least one meta phrase consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
Specifically, the character position specifically refers to a position of the polyphone in the initial text, the selection policy specifically refers to a rule for generating the meta phrase, the adjacent character position specifically refers to a position corresponding to adjacent characters before and after the polyphone, and the adjacent characters are adjacent characters of the polyphone.
Based on this, firstly, the character position of the polyphone in the initial text is determined according to the polyphone identifier, then the adjacent character position adjacent to the character position is determined according to a preset selection strategy, if the position of 5 characters adjacent to the polyphone in front and back is selected as the adjacent character position, at this time, the adjacent character corresponding to the polyphone can be determined in the initial text according to the adjacent character position, and finally, at least one meta-phrase consisting of the adjacent character and the polyphone can be constructed according to the arrangement sequence of the adjacent character and the polyphone in the initial text.
It should be noted that, because the pronunciations of the polyphones corresponding to different phrases may be different, in order to accurately analyze the correctness of the current i-th pinyin sequence, multiple meta-phrases may be created, and then each meta-phrase is checked one by one, as long as any one meta-phrase is consistent with the reference phrase, the i-th pinyin sequence may be considered to be correct, that is, the pinyin of the polyphone is correct, and subsequent text processing may be performed; in addition, the verification of the ith Pinyin sequence can be carried out in a proportion analysis mode, namely if the proportion of the consistency rate of the meta phrase and the reference phrase is higher than a certain proportion threshold value, the ith Pinyin sequence can be considered to be correct; in practical application, a specific verification policy may be set according to actual requirements, and this embodiment is not limited herein. The consistency between the meta phrase and the reference phrase mentioned in this embodiment may be that each phrase is consistent with the reference phrase, or that the subsequent portions are consistent or consistent.
Along the above example, when the initial text [ (1, 2, 5) -facing the sun and sitting in a car ] extracted from the canonical text library is obtained, it can be determined that "facing" and "facing" are polyphonic characters through the polyphonic character identifier carried in the initial text at this time, and in this example, the process of processing and checking the polyphonic character will be described as an example, and the corresponding description contents of the process of processing the polyphonic character in this embodiment can be referred to, without any limitation.
Based on this, it can be determined that the initial text is [ (1) -facing the sun and riding in the sun ], then the initial text is input to a pinyin generation module for processing, a plurality of pinyin sequences are obtained, the pinyin sequences are respectively a first pinyin sequence { ying2-zhao2-chao1-yang2-zuo4-che1} and a second pinyin sequence { ying2-zhe5-chao1-yang2-zuo4-che1}, the first pinyin sequence { ying2-zhao2-chao1-yang2-zuo4-che1} is selected for carrying out the verification of the correctness of the pinyin ' facing ' on ', the position of the polyphone ' facing ' in the initial text is a second Chinese character according to the polyphone identification, then the first character before and after "facing" is selected to form a first unary { facing the sun, the first character facing "is selected to form a first binary character" (facing the first character) in the initial text }, therefore, when two or more characters are selected, the first two or more characters are replaced by the empty set), three characters before and after the character is selected to form a quaternary phrase { sitting facing the sun } and four characters before and after the character is selected to form a quaternary phrase { sitting facing the sun } for subsequent verification of the correctness of the polyphone.
In conclusion, in order to accurately verify the pinyin of a plurality of polyphones, at least one meta-phrase is generated by combining the initial text and used in the subsequent analysis and processing process, so that the verification accuracy can be ensured, the constructed meta-phrase can not deviate from the meaning expressed by the initial text, and the verification accuracy is further improved.
And step S106, determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence.
Specifically, on the basis of determining the ith pinyin sequence and the meta-phrase corresponding to the initial text, further, a reference phrase for checking the correctness of the pinyin of the polyphone is created according to the ith pinyin sequence, where the reference phrase specifically refers to a phrase compared with the meta-phrase, and if the polyphone characters in the reference phrase are the same as those in the meta-phrase, it is determined that the pinyin of the polyphone in the ith pinyin sequence is correct, otherwise, it is determined that the pinyin of the polyphone in the ith pinyin sequence is incorrect, that is, the reference phrase is a standard for checking the wrong and correct pinyin of the polyphone; before generating the reference phrase, the phrase spelling sequence of the meta-phrase needs to be determined, and then the reference phrase is generated through the phrase spelling sequence, so that the specific comparison condition of the meta-phrase and the reference phrase can be analyzed.
On the basis, because the pinyin correctness of the polyphones in the meta-phrase needs to be checked, the pinyin of the polyphones in the meta-phrase can be checked by adopting a mode that the phrase pinyin sequence of the meta-phrase generates a reference phrase, namely, the meta-phrase is compared with the reference phrase, and then if the two phrases are consistent, the pinyin expression of the polyphones in the phrase pinyin sequence is correct, and further, the ith pinyin sequence of the initial text is correct and is used for subsequently generating the text pinyin group.
In the process, the pinyin verification of the polyphones in the initial text can be realized only by ensuring that the reference phrases generated by the text generation module based on the phrase pinyin sequence are correct, so that the text generation module provided by the embodiment is created by using a cloud input method, and the cloud input method is realized by relying on a cloud computing base number, so that the unlimited storage and computing capacity of a server can be effectively utilized, the precision of generating the reference phrases through the phrase pinyin sequence is improved, and the accurate verification of the ith pinyin sequence is realized.
In addition, in practical application, the text generation module can also be implemented by using a text processing model in the field of machine learning, and it should be noted that the text generation module can be applied only under the condition that the accuracy of generating the reference phrase by the phrase pinyin sequence is ensured, so as to meet the requirement of accurately checking the pinyin correctness of polyphones in the initial text.
Further, in the process of generating the phrase pinyin sequence, since the meta-phrase is constructed based on the characters in the initial text, the phrase pinyin sequence of the meta-phrase may be determined according to the ith pinyin sequence, and in this embodiment, the specific implementation manner is as follows:
preprocessing the initial text to obtain a plurality of initial characters, and preprocessing the meta-phrase to obtain a plurality of meta-characters;
determining the pinyin of each initial character in the plurality of initial characters according to the ith pinyin sequence;
determining a pinyin for each of the plurality of meta-characters based on the pinyin for each of the plurality of initial characters;
and generating the phrase pinyin sequence according to the pinyin of each element character in the plurality of element characters.
Specifically, the preprocessing specifically refers to performing word segmentation processing on the initial text and the meta-phrase; based on the above, firstly, the initial text is processed by dividing the character to obtain a plurality of initial characters, and meanwhile, the element word group is processed by dividing the character to obtain a plurality of element characters, secondly, the pinyin of each initial character in the plurality of initial characters is determined according to the ith pinyin sequence, thirdly, the pinyin of each element character in the plurality of element characters is determined based on the pinyin of each initial character in the plurality of initial characters, and finally, the element word pinyin sequence can be generated according to the pinyin of each element character in the plurality of element characters.
According to the above example, when obtaining a first pinyin sequence { ying2-zhao2-chao 1-yang-facing 2-zuo4-che1} and a first phrase { facing towards the ward }, a second phrase { facing towards the sun }, a third phrase { facing towards the sun } and a fourth phrase { facing towards the sun } first, performing character splitting processing on the initial text "facing towards the sun's sitting car" to obtain a plurality of initial characters (facing towards, yang, sitting, car), performing character splitting processing on the first phrase { facing towards the sun } to obtain a plurality of primitive characters (facing towards, facing towards the sun }, performing character splitting processing on the second phrase { facing towards the sun } to obtain a plurality of primitive characters (facing towards, yang), performing character splitting processing on the third phrase { facing towards the sun } to obtain a plurality of primitive characters (facing towards the sun ), and performing character splitting processing on the third phrase { facing towards the sun } to obtain a plurality of primitive characters (facing towards the sun) ("facing towards the sun } Toward, yang, sitting, and vehicle).
Then, the pinyin of each initial character in the plurality of initial characters is determined according to the first pinyin sequence { ying2-zhao2-chao1-yang2-zuo4-che1} (up to "-" ying2 "," down to "-" zhao2 "," towards "-" chao1 "," positive "-" yang2 "," sitting "-" zuo4 "," car "-" 1), the pinyin of each meta character in each meta word group is determined according to the pinyin of each initial character in the plurality of initial characters, namely, the pinyin of each meta character in the first meta word group is (up to "-" ying2 "," down to "-" zhao2 "," towards "-" chao1 "), the pinyin of each character in the second meta word group is (" 596 "-" ying "," ying "-" ying "," z 2 "," up to "-" zhao1 "," cha 638 "," 2 "," - "zhao 2", "-" yang "-" cha ". The pinyin for each meta-character in the third phrase is ("on" - "ying 2", "on" - "zhao 2", "on" - "chao 1", "on" - "yang 2", "on" - "zuo 4"), and the pinyin for each meta-character in the fourth phrase is ("on" - "ying 2", "on" - "zhao 2", "on" - "chao 1", "on" - "yang 2", "on" - "zuo 4", "vehicle" - "che 1").
Finally, based on the pinyin of each meta-character in each meta-phrase, it can be determined that the first word group pinyin sequence of the first meta-phrase is { ying2-zhao2-chao1}, the second word group pinyin sequence of the second meta-phrase is { ying2-zhao2-chao1-yang2}, the third word group pinyin sequence of the third phrase is { ying2-zhao2-chao1-yang2-zuo4}, the fourth word group pinyin sequence of the fourth phrase is { ying2-zhao2-chao 1-yao 2-zuo4-che1}, then each word group pinyin sequence is input to a cloud input method for generating a reference text, it needs to be stated that when the cloud input method is input for processing, a tone mark is taken out to meet the input condition of the cloud input method, i.e. the corresponding first word group pinyin sequence is obtained, and the corresponding second mapping super-pinyin sequence is "the reference text, and the third reference phrase corresponding to the third phrase pinyin sequence is 'mapping super-poplar to do', the fourth reference phrase corresponding to the fourth phrase pinyin sequence is 'mapping super-poplar to sit on the vehicle', and after the reference phrases corresponding to the phrase pinyin sequences are obtained, subsequent verification processing is carried out.
In sum, the reference phrase is generated by combining the phrase pinyin sequence of the meta-phrase with the generation processing module, so that the accuracy of the reference phrase can be improved, the pinyin correctness of the polyphones in the meta-phrase can be verified, the verification accuracy is effectively improved, and the efficiency of subsequently updating or constructing the polyphone text library is promoted.
Further, after the reference word is obtained, the reference phrase and the meta-phrase are compared, if the reference phrase is not consistent with the meta-phrase, step S108 is executed, and if the reference phrase is consistent with the meta-phrase, step S110 is executed.
It should be noted that, in the process of comparing the reference phrase and the meta phrase, it is actually determined whether the polyphone in the meta phrase appears in the reference phrase and whether the positions of the polyphone and the meta phrase are the same, so as to assist in analyzing whether the pinyin of the polyphone in the ith pinyin sequence is correct.
And step S108, if the meta phrase is not consistent with the reference phrase, increasing 1 by i, and returning to execute the step S104.
Specifically, under the condition that the meta phrase and the reference phrase are not consistent, it is described that the polyphones in the meta phrase are not the same as the characters in the reference phrase, and it is further described that the pinyin of the polyphones in the ith pinyin sequence is wrong, at this time, i can be increased by 1, the next pinyin sequence (the pinyin sequence generated based on another pronunciation of the polyphones) of the initial text is determined, and then the step S104 is returned to be executed, and the pinyin checking process of the polyphones is carried out again.
In the above example, when the first reference phrase "mapping super", the second reference phrase "mapping super-young", the third reference phrase "mapping super-young" and the fourth reference phrase "mapping super-young sitting car" are obtained, the first reference phrase and the first phrase are compared, the second reference phrase and the second phrase are compared, the third reference phrase and the third phrase are compared, and the fourth reference phrase and the fourth phrase are compared, and it is determined that the four comparison results are not consistent, the pinyin "zhao" generated for the polyphone "in the first pinyin sequence { ying2-zhao2-chao1-yang 2-yang 4-che1} is considered to be incorrect, and then the second pinyin sequence { ying2-zhao 5-chao1-yang2-zuo4-che1} is selected for verification, and the specific processing procedure can be referred to the description of the first pinyin processing procedure, will not be described in detail herein.
In addition, after i is increased by 1, a problem that all pinyin sequences corresponding to the initial text are verified may occur, that is, the value after i is increased by 1 is greater than the value of the number of pronunciations corresponding to the polyphones, at this time, verification processing cannot be performed any more, and it is further described that the initial text may not have a correct pronunciation, at this time, the initial text may be deleted from the canonical text library and written into the non-canonical text library for use in other business processes, in this embodiment, a specific implementation manner is as follows:
detecting whether the (i + 1) th pinyin sequence is consistent with the (i) th pinyin sequence;
if not, executing the step of constructing at least one meta phrase containing the polyphone according to the polyphone identification and the initial text;
and if the initial text is consistent with the non-standard text, writing the initial text into a non-standard text library.
Specifically, the non-standard text library is a text library for temporarily storing an initial text which cannot be used, and based on this, when the (i + 1) th pinyin sequence is detected to be inconsistent with the (i) th pinyin sequence, it is indicated that the pinyin of the polyphone in the (i + 1) th pinyin sequence is not verified, and then the verification process is executed again; under the condition that the i +1 th pinyin sequence is detected to be consistent with the i-th pinyin sequence, the fact that all pinyin sequences corresponding to the initial text are verified, correct pinyin of polyphones in the initial text is not found, the initial text can be deleted from the standard text library and added into the non-standard text library or cleared, and therefore storage resources of the standard text library are released in time, and waste of the storage resources is avoided.
And step S110, under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
Specifically, under the condition that the meta phrase is consistent with the reference phrase, it is described that the polyphone in the meta phrase is the same as the character in the reference phrase, and further it is described that the pinyin of the polyphone in the ith pinyin sequence is correct, and at this time, the text pinyin group can be created according to the polyphone identifier, the initial text and the ith pinyin sequence and written into the polyphone text library; the text pinyin group specifically refers to a combined expression including a polyphone identifier, an initial text and polyphone pinyin, such as a text pinyin group in a polyphone text library in fig. 2.
Further, the specific process of generating the text pinyin group is as follows:
determining the pinyin position of the pinyin corresponding to the polyphone in the ith pinyin sequence based on the polyphone identifier;
extracting pinyin corresponding to the polyphone from the ith pinyin sequence according to the pinyin position;
and integrating the initial text, the polyphone identification and the pinyin corresponding to the polyphone to obtain the text pinyin group.
Specifically, because the pinyin contained in the ith pinyin sequence corresponds to each character in the initial text, the pinyin position of the pinyin corresponding to the polyphone in the ith pinyin sequence can be determined through the polyphone identifier, the pinyin corresponding to the polyphone is extracted from the ith pinyin sequence according to the pinyin position, and finally the initial text, the polyphone identifier and the pinyin corresponding to the polyphone are integrated to obtain the text pinyin group.
In the above example, when the second pinyin sequence { ying2-zhe5-chao1-yang2-zuo4-che1} is selected for verification processing, and at this time, it is determined that a plurality of meta word groups generated based on the second pinyin sequence are all consistent with the corresponding reference word groups, it is determined that the pinyin "zhe" generated for the polyphonic character "in the second pinyin sequence { ying2-zhe5-chao1-yang2-zuo4-che1} is correct, and the pinyin" zhe "is used for subsequently generating a text pinyin group by combining the polyphonic character identifier and the initial text, that is, it is determined that the corresponding pinyin" zhe "in the polyphonic character" zhe "is a second position in the second pinyin sequence { ying 2-zhezu 5-chao1-yang 2-4-che 1} according to the polyphonic character identifier" 1 ", and it is determined that the pinyin" zhe "zhu" 26-zhu "8653" in the second pinyin sequence { ying 686-zhao 5-zhao 8653-zao 8653-, and then integrating the polyphone identifier '1', 'facing the sun and sitting in the car' and 'zhe 5', so as to obtain a text pinyin group { (1) } -facing the sun and sitting in the car- (zhe5) }, and writing the text pinyin group { (1) } facing the sun into a polyphone text library for downstream service use.
In summary, the text pinyin group is created by integrating the initial text, the polyphone identifier and the pinyin corresponding to the polyphone, so that the normalization of the text pinyin group can be ensured, and the regularity of the data of the polyphone text library can be further ensured, thereby facilitating the quick calling and use of the downstream service when the downstream service is used, and effectively improving the service completion efficiency of the downstream service.
In addition, after the updating or the construction of the canonical text base is completed, the service completion efficiency of the downstream service can be promoted according to the text pinyin group included in the canonical text base, which is described in this embodiment by taking the downstream service as a model training service as an example, and the specific implementation process refers to a processing flow chart of a model training process shown in fig. 6:
step S1102, under the condition that a reading request submitted by aiming at the polyphone text library is received, reading a training text pinyin group in the polyphone text library according to the reading request;
step S1104, analyzing the training text pinyin group to obtain a training initial text and a training pinyin sequence;
step S1106, training the initial pinyin marking model based on the training initial text and the training pinyin sequence to obtain a target pinyin marking model.
Specifically, under the condition that a reading request submitted for the polyphone text library is received, it is indicated that the text pinyin groups in the polyphone text library are required to be used for model training at this time, the reading request can be analyzed, the number of the text pinyin groups required to be read is determined, that is, the training text pinyin groups used for training an initial pinyin marking model are read in the polyphone text library according to the reading request, wherein the initial pinyin marking model is used for performing pinyin translation on characters in a text, and in order to improve the accuracy of the pinyin translation, the method is implemented according to a semantic analysis mode, so that the marked pinyin is the correct pinyin corresponding to the text.
Based on this, after the training text pinyin group is obtained, the training text pinyin group is analyzed, so that a training initial text, a training pinyin sequence and a polyphone identifier contained in the training text pinyin group can be obtained, finally, the training initial text is used as the input of the initial pinyin marking model, the training pinyin sequence is used as the output of the initial pinyin marking model, the initial pinyin marking model is trained, and finally, a target pinyin marking model meeting the use requirement is obtained.
In practical application, when the initial pinyin marking model is trained by training the initial text and the training pinyin sequence, whether to stop training or not can be determined by monitoring the loss function value, or whether to stop training or not can be determined by monitoring the output accuracy of the model, so as to obtain the target pinyin marking model meeting the use requirement.
The text processing method provided by the specification comprises the steps of determining a pinyin sequence of an initial text after the initial text containing polyphones is obtained, constructing at least one element phrase containing the polyphones based on polyphone marks carried by the initial text, determining a phrase pinyin sequence of the element phrase according to the obtained pinyin sequence, generating a reference phrase based on the phrase pinyin sequence, checking the correctness of the pinyin sequence by comparing the reference phrase with the element phrase, re-determining a new pinyin sequence of the initial text if the checking result is inconsistent, executing the process until the checking result is consistent, determining the correct pinyin of the polyphones in the initial text, integrating the pinyin sequence, the polyphone marks and the initial text under the condition that the checking result is consistent into a text pinyin group, and writing the text pinyin group into a polyphone text library, the method and the device have the advantages that when the pinyin marking is carried out on the polyphone in the initial text, the correct pinyin of the polyphone can be determined in a checking mode, manpower and material resources are saved, the correct rate of the finally created text pinyin group can be effectively guaranteed, the construction of the polyphone text library can be efficiently and quickly completed, so that the development of corresponding services cannot be influenced by the problems of the quality and the quantity of data in the library when the polyphone text library is used by downstream services, and the service completion efficiency of the downstream services is further improved.
Corresponding to the above method embodiment, this specification further provides a text processing apparatus embodiment, and fig. 7 shows a schematic structural diagram of a text processing apparatus provided in an embodiment of this specification. As shown in fig. 7, the apparatus includes:
an obtaining module 702, configured to obtain an initial text carrying a polyphone identifier, where the initial text includes at least one polyphone;
a determining module 704 configured to determine an ith pinyin sequence corresponding to the initial text, and construct at least one meta-phrase including the polyphone according to the polyphone identifier and the initial text, where i is a value from 1 and i is a positive integer;
the processing module 706 is configured to determine a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and input the phrase pinyin sequence to the text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
if the meta-phrase is not consistent with the reference phrase, i is incremented by 1, and the determining module 704 is operated;
in the case that the meta phrase and the reference phrase are identical, a write module 708 is executed, where the write module 708 is configured to create a text pinyin group based on the multi-syllable symbol, the initial text, and the ith pinyin sequence, and write the text pinyin group into a multi-syllable text library.
In an optional embodiment, the text processing apparatus further includes:
the acquisition module is configured to acquire a text to be processed and perform normalization processing on the text to be processed to obtain a standard text;
the marking module is configured to determine standard polyphones in the standard text based on a preset polyphone dictionary and mark the standard polyphones;
and the writing standard text library module is configured to obtain the standard text carrying the multi-tone character identification according to the marking result and write the standard text carrying the multi-tone character identification into the standard text library.
In an optional embodiment, the obtaining module 702 is further configured to:
under the condition that an updating request for updating the polyphone text library is received, extracting the initial text carrying polyphone identifications in the standard text library based on the updating request, wherein the polyphone identifications are used for marking the character position of at least one polyphone contained in the initial text.
In an optional embodiment, the determining module 704 is further configured to:
and inputting the initial text into a pinyin generation module for processing to obtain an ith pinyin sequence corresponding to the initial text output by the pinyin generation module, wherein i is a value from 1 and is a positive integer.
In an optional embodiment, the determining module 704 includes:
a determining character position unit configured to determine a character position of the polyphone in the initial text based on the polyphone identification;
the adjacent word determining unit is configured to determine an adjacent character position adjacent to the character position through a preset selection strategy, and determine an adjacent word corresponding to the adjacent character position according to the initial text;
and the component meta phrase unit is configured to construct at least one meta phrase consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
In an optional embodiment, the processing module 706 includes:
the preprocessing unit is configured to preprocess the initial text to obtain a plurality of initial characters, and preprocess the meta-phrase to obtain a plurality of meta-characters;
a first pinyin-determining unit configured to determine a pinyin for each of the plurality of initial characters according to the ith pinyin sequence;
a second pinyin-determining unit configured to determine a pinyin for each of the plurality of meta-characters based on the pinyin for each of the plurality of initial characters;
and the phrase pinyin sequence unit is configured to generate the phrase pinyin sequence according to the pinyin of each meta-character in the plurality of meta-characters.
In an optional embodiment, the text processing apparatus further includes:
the detection module is configured to detect whether the (i + 1) th pinyin sequence is consistent with the ith pinyin sequence;
if not, the determining module 704 is run;
and if so, operating a writing-in text library module, wherein the writing-in text library module is configured to write the initial text into an nonstandard text library.
In an alternative embodiment, the writing module 708 includes:
a pinyin position determining unit configured to determine a pinyin position of a pinyin corresponding to the polyphone in the ith pinyin sequence based on the polyphone identifier;
the pinyin extracting unit is configured to extract pinyin corresponding to the polyphone in the ith pinyin sequence according to the pinyin position;
and the integration unit is configured to integrate the initial text, the polyphone identification and the pinyin corresponding to the polyphone to obtain the text pinyin group.
In an optional embodiment, the text processing apparatus further includes:
a reading module configured to read a training text pinyin group in the polyphonic text library according to a read request submitted for the polyphonic text library in case of receiving the read request;
the analysis module is configured to analyze the training text pinyin group to obtain a training initial text and a training pinyin sequence;
and the training module is configured to train the initial pinyin marking model based on the training initial text and the training pinyin sequence to obtain a target pinyin marking model.
In an alternative embodiment, the initial text is an initial chinese text, and the pinyin included in the ith pinyin sequence has tones.
The text processing apparatus provided in this embodiment, after obtaining an initial text containing polyphones, determines a pinyin sequence of the initial text, and constructs at least one meta-phrase containing polyphones based on polyphone identifiers carried by the initial text, and then determines a phrase pinyin sequence of the meta-phrase according to the obtained pinyin sequence, and generates a reference phrase based on the phrase pinyin sequence, and then verifies the correctness of the pinyin sequence by comparing the reference phrase with the meta-phrase, if the verification result is inconsistent, re-determines a new pinyin sequence of the initial text, and then executes the above process until the verification result is consistent, and then determines the correct pinyin of the polyphones in the initial text, and then integrates the pinyin sequence, the polyphone identifiers and the initial text under the consistent verification result into a text pinyin group, and writes the pinyin group into a polyphone text library, the method and the device have the advantages that when the pinyin marking is carried out on the polyphone in the initial text, the correct pinyin of the polyphone can be determined in a checking mode, manpower and material resources are saved, the correct rate of the finally created text pinyin group can be effectively guaranteed, the construction of the polyphone text library can be efficiently and quickly completed, so that the development of corresponding services cannot be influenced by the problems of the quality and the quantity of data in the library when the polyphone text library is used by downstream services, and the service completion efficiency of the downstream services is further improved.
The above is a schematic scheme of a text processing apparatus of the present embodiment. It should be noted that the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the text processing apparatus can be referred to the description of the technical solution of the text processing method.
Fig. 8 illustrates a block diagram of a computing device 800 provided in accordance with an embodiment of the present description. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 8 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein, the processor 820 is configured to execute the following computer-executable instructions:
acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text processing method.
An embodiment of the present specification also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are operable to:
acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text processing method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims (13)

1. A method of text processing, comprising:
acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
2. The method of claim 1, wherein before the step of obtaining the initial text carrying the polyphonic identifier is performed, the method further comprises:
acquiring a text to be processed, and carrying out normalization processing on the text to be processed to obtain a standard text;
determining standard polyphone characters in the standard text based on a preset polyphone dictionary, and marking the standard polyphone characters;
and obtaining a standard text carrying the multi-tone character identification according to the marking result, and writing the standard text carrying the multi-tone character identification into a standard text library.
3. The method of claim 2, wherein the obtaining of the initial text carrying the polyphonic identity comprises:
under the condition that an updating request for updating the polyphone text library is received, extracting the initial text carrying polyphone identifications in the standard text library based on the updating request, wherein the polyphone identifications are used for marking the character position of at least one polyphone contained in the initial text.
4. The method of claim 1, wherein the determining the ith pinyin sequence corresponding to the initial text comprises:
and inputting the initial text into a pinyin generation module for processing to obtain an ith pinyin sequence corresponding to the initial text output by the pinyin generation module, wherein i is a value from 1 and is a positive integer.
5. The method of claim 1, wherein the constructing at least one meta-phrase containing the polyphones according to the polyphone identifications and the initial text comprises:
determining a character position of the polyphone in the initial text based on the polyphone identification;
determining adjacent character positions adjacent to the character positions through a preset selection strategy, and determining adjacent words corresponding to the adjacent character positions according to the initial text;
and constructing at least one meta phrase consisting of the adjacent words and the polyphones according to the arrangement sequence of the adjacent words and the polyphones in the initial text.
6. The method of claim 1, wherein determining a phrase pinyin sequence for the meta-phrase based on the ith pinyin sequence comprises:
preprocessing the initial text to obtain a plurality of initial characters, and preprocessing the meta-phrase to obtain a plurality of meta-characters;
determining the pinyin of each initial character in the plurality of initial characters according to the ith pinyin sequence;
determining a pinyin for each of the plurality of meta-characters based on the pinyin for each of the plurality of initial characters;
and generating the phrase pinyin sequence according to the pinyin of each element character in the plurality of element characters.
7. The method according to claim 1, wherein after the step of increasing by 1 and determining the ith pinyin sequence corresponding to the initial text if the meta-phrase and the reference phrase are not consistent, the method further comprises:
detecting whether the (i + 1) th pinyin sequence is consistent with the (i) th pinyin sequence;
if not, executing the step of constructing at least one meta phrase containing the polyphone according to the polyphone identification and the initial text;
and if the initial text is consistent with the non-standard text, writing the initial text into a non-standard text library.
8. The text processing method of claim 1, wherein creating a text pinyin-group based on the multi-pinyin character identifier, the initial text, and the ith pinyin sequence comprises:
determining the pinyin position of the pinyin corresponding to the polyphone in the ith pinyin sequence based on the polyphone identifier;
extracting pinyin corresponding to the polyphone from the ith pinyin sequence according to the pinyin position;
and integrating the initial text, the polyphone identification and the pinyin corresponding to the polyphone to obtain the text pinyin group.
9. The method of claim 1, wherein after the steps of creating a text pinyin group based on the polyphonic identification, the initial text, and the ith pinyin sequence, and writing to a polyphonic text library are performed, further comprising:
under the condition that a reading request submitted by aiming at the polyphone text library is received, reading a training text pinyin group in the polyphone text library according to the reading request;
analyzing the training text pinyin group to obtain a training initial text and a training pinyin sequence;
and training an initial pinyin marking model based on the training initial text and the training pinyin sequence to obtain a target pinyin marking model.
10. The method of claim 1, wherein the initial text is an initial chinese text, and wherein the pinyin included in the ith pinyin sequence has tones.
11. A text processing apparatus, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is configured to acquire an initial text carrying a polyphone identifier, and the initial text comprises at least one polyphone;
a determining module configured to determine an ith pinyin sequence corresponding to the initial text, and construct at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
the processing module is configured to determine a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, input the phrase pinyin sequence to the text generation module for processing, and obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1 by self, and the determining module is operated;
and under the condition that the meta phrase is consistent with the reference phrase, operating a writing module, wherein the writing module is configured to create a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and write the text pinyin group into a polyphone text library.
12. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the method of:
acquiring an initial text carrying a polyphone identifier, wherein the initial text comprises at least one polyphone;
determining an ith pinyin sequence corresponding to the initial text, and constructing at least one meta-phrase containing the polyphone according to the polyphone identifier and the initial text, wherein i is a value from 1 and is a positive integer;
determining a phrase pinyin sequence of the meta-phrase according to the ith pinyin sequence, and inputting the phrase pinyin sequence to a text generation module for processing to obtain a reference phrase corresponding to the phrase pinyin sequence;
under the condition that the meta phrase is inconsistent with the reference phrase, i is increased by 1, and the step of determining the ith pinyin sequence corresponding to the initial text is executed;
and under the condition that the meta phrase is consistent with the reference phrase, creating a text pinyin group based on the polyphone identifier, the initial text and the ith pinyin sequence, and writing the text pinyin group into a polyphone text library.
13. A computer-readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the text processing method of any one of claims 1 to 10.
CN202011133952.2A 2020-10-21 2020-10-21 Text processing method and device Active CN112257420B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011133952.2A CN112257420B (en) 2020-10-21 2020-10-21 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011133952.2A CN112257420B (en) 2020-10-21 2020-10-21 Text processing method and device

Publications (2)

Publication Number Publication Date
CN112257420A true CN112257420A (en) 2021-01-22
CN112257420B CN112257420B (en) 2024-06-18

Family

ID=74264493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011133952.2A Active CN112257420B (en) 2020-10-21 2020-10-21 Text processing method and device

Country Status (1)

Country Link
CN (1) CN112257420B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000010964A (en) * 1998-06-17 2000-01-14 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium recording chinese input conversion processing program
US20050209844A1 (en) * 2004-03-16 2005-09-22 Google Inc., A Delaware Corporation Systems and methods for translating chinese pinyin to chinese characters
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN105404621A (en) * 2015-09-25 2016-03-16 中国科学院计算技术研究所 Method and system for blind people to read Chinese character
CN109977361A (en) * 2019-03-01 2019-07-05 广州多益网络股份有限公司 A kind of Chinese phonetic alphabet mask method, device and storage medium based on similar word
CN111667810A (en) * 2020-06-08 2020-09-15 北京有竹居网络技术有限公司 Method and device for acquiring polyphone corpus, readable medium and electronic equipment
CN111798834A (en) * 2020-07-03 2020-10-20 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000010964A (en) * 1998-06-17 2000-01-14 Toshiba Corp Chinese input conversion processor, chinese input conversion processing method and recording medium recording chinese input conversion processing program
US20050209844A1 (en) * 2004-03-16 2005-09-22 Google Inc., A Delaware Corporation Systems and methods for translating chinese pinyin to chinese characters
CN105404621A (en) * 2015-09-25 2016-03-16 中国科学院计算技术研究所 Method and system for blind people to read Chinese character
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN109977361A (en) * 2019-03-01 2019-07-05 广州多益网络股份有限公司 A kind of Chinese phonetic alphabet mask method, device and storage medium based on similar word
CN111667810A (en) * 2020-06-08 2020-09-15 北京有竹居网络技术有限公司 Method and device for acquiring polyphone corpus, readable medium and electronic equipment
CN111798834A (en) * 2020-07-03 2020-10-20 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张丽青;寿永熙;马志强;: "最大熵算法在汉语拼音标注中的研究与实现", 微电子学与计算机, no. 08, 5 August 2012 (2012-08-05) *

Also Published As

Publication number Publication date
CN112257420B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
US20210027788A1 (en) Conversation interaction method, apparatus and computer readable storage medium
CN108287858B (en) Semantic extraction method and device for natural language
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN110444198B (en) Retrieval method, retrieval device, computer equipment and storage medium
CN102982021B (en) For eliminating the method for the ambiguity of the multiple pronunciations in language conversion
CN111046133A (en) Question-answering method, question-answering equipment, storage medium and device based on atlas knowledge base
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
CN111292751B (en) Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN110209802B (en) Method and device for extracting abstract text
CN112259083A (en) Audio processing method and device
CN111881297A (en) Method and device for correcting voice recognition text
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN117217315A (en) Method and device for generating high-quality question-answer data by using large language model
CN110969005B (en) Method and device for determining similarity between entity corpora
CN109065015B (en) Data acquisition method, device and equipment and readable storage medium
CN113051384B (en) User portrait extraction method based on dialogue and related device
CN113436614A (en) Speech recognition method, apparatus, device, system and storage medium
CN114491010A (en) Training method and device of information extraction model
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN110516125A (en) Identify method, apparatus, equipment and the readable storage medium storing program for executing of unusual character string
CN112257420A (en) Text processing method and device
CN115019788A (en) Voice interaction method, system, terminal equipment and storage medium
CN114155841A (en) Voice recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant