CN113688613A - Method, device and storage medium for generating field annotation and understanding character string - Google Patents

Method, device and storage medium for generating field annotation and understanding character string Download PDF

Info

Publication number
CN113688613A
CN113688613A CN202010425675.6A CN202010425675A CN113688613A CN 113688613 A CN113688613 A CN 113688613A CN 202010425675 A CN202010425675 A CN 202010425675A CN 113688613 A CN113688613 A CN 113688613A
Authority
CN
China
Prior art keywords
pinyin
character
sequence
field
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010425675.6A
Other languages
Chinese (zh)
Inventor
郭立帆
徐阆平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010425675.6A priority Critical patent/CN113688613A/en
Publication of CN113688613A publication Critical patent/CN113688613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the application provides a field annotation generation method, a character string understanding method, a device and a storage medium. In the embodiment of the application, for the field names with missing field comments, pinyin word segmentation can be carried out on the field names so as to obtain pinyin sequences corresponding to the field names; by understanding the pinyin sequence, a Chinese sequence corresponding to the pinyin sequence can be generated, and further, a field annotation corresponding to the field name is generated according to the Chinese sequence. Therefore, in the embodiment of the application, the field annotation supplementing work does not depend on a manual mode any more, the generation efficiency of the field annotation can be effectively improved, and the accuracy of the field annotation can be ensured through reasonable word segmentation and accurate understanding of the field name.

Description

Method, device and storage medium for generating field annotation and understanding character string
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, device, and storage medium for field annotation generation and string understanding.
Background
With the increasing amount of user data, more and more users have issued the requirement of data standardization, and it is expected that high-quality data assets are obtained. One of the key items of processing in the data normalization process is the completion of annotations to field names in the database.
At present, the field names in the database need to be annotated and completed in a manual mode, and the efficiency and the accuracy of the mode are low.
Disclosure of Invention
Aspects of the present application provide a field comment generation method, a character string understanding method, a device, and a storage medium, which are used to improve generation efficiency and accuracy of a field comment.
The embodiment of the application provides a field annotation generation method, which comprises the following steps:
acquiring a field name to be processed;
performing pinyin word segmentation on the field names to obtain pinyin sequences;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating a field annotation corresponding to the field name according to the Chinese sequence.
The embodiment of the present application further provides a method for understanding a character string, including:
acquiring a character string to be understood;
performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating an understanding result corresponding to the character string to be understood according to the Chinese sequence.
The embodiment of the application also provides a computing device, which comprises a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
acquiring a field name to be processed;
performing pinyin word segmentation on the field names to obtain pinyin sequences;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating a field annotation corresponding to the field name according to the Chinese sequence.
The embodiment of the application also provides a computing device, which comprises a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
acquiring a character string to be understood;
performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating an understanding result corresponding to the character string to be understood according to the Chinese sequence.
Embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the aforementioned field comment generation method or the aforementioned character string understanding method.
In the embodiment of the application, for the field names with missing field comments, pinyin word segmentation can be carried out on the field names so as to obtain pinyin sequences corresponding to the field names; by understanding the pinyin sequence, a Chinese sequence corresponding to the pinyin sequence can be generated, and further, a field annotation corresponding to the field name is generated according to the Chinese sequence. Therefore, in the embodiment of the application, the field annotation supplementing work does not depend on a manual mode any more, the generation efficiency of the field annotation can be effectively improved, and the accuracy of the field annotation can be ensured through reasonable word segmentation and accurate understanding of the field name.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart illustrating a field annotation generation method according to an exemplary embodiment of the present application;
FIG. 2 is a logic block diagram of a field annotation generation method according to an exemplary embodiment of the present application;
fig. 3 is a schematic flowchart of a method for understanding a character string according to another exemplary embodiment of the present application;
FIG. 4 is a schematic block diagram of a computing device according to yet another exemplary embodiment of the present application;
fig. 5 is a schematic structural diagram of another computing device according to yet another exemplary embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, the field names in the database need to be annotated and completed in a manual mode, and the efficiency and the accuracy of the mode are low. In view of these technical problems, the embodiments of the present application provide a solution, and one of the basic ideas is: for the field names with missing field comments, pinyin word segmentation can be carried out on the field names to obtain pinyin sequences corresponding to the field names; by understanding the pinyin sequence, a Chinese sequence corresponding to the pinyin sequence can be generated, and further, a field annotation corresponding to the field name is generated according to the Chinese sequence. Therefore, in the embodiment of the application, the field annotation supplementing work does not depend on a manual mode any more, the generation efficiency of the field annotation can be effectively improved, and the accuracy of the field annotation can be ensured through reasonable word segmentation and accurate understanding of the field name.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart illustrating a field annotation generation method according to an exemplary embodiment of the present application. Fig. 2 is a logic block diagram of a field annotation generation method according to an exemplary embodiment of the present application. The field comment generation method provided by the embodiment may be executed by a field comment generation apparatus, which may be implemented as software or as a combination of software and hardware, and may be integrally provided in a computing device. As shown in fig. 1 and 2, the method includes:
step 100, acquiring a field name to be processed;
101, performing pinyin word segmentation on the field names to obtain pinyin sequences;
102, determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese;
and 103, generating a field annotation corresponding to the field name according to the Chinese sequence.
The field comment generation method provided by this embodiment may be applied to other fields in a database, a spreadsheet, or the like, and the application scenario is not limited in this embodiment. Taking a database as an example, in most cases, columns in a data table may be referred to as fields, each of which contains information for a particular topic. Taking a spreadsheet as an example, a column in the spreadsheet may also be used as a field.
The field corresponds to a field name, which is the name of the information contained in the field. For example, in a database scenario, the field name may be the name of attribute class information, such as identification card, gender, and so on.
In practical applications, the field names are usually composed of characters in order to adapt to the requirements of software code technology and the like. Moreover, the writing manners of the field names may not be completely the same according to the habits of different technicians, which results in lower readthrough of the field names. Thus, field names are typically configured with field comments that explain the meaning of the field name. For example, the field name is yhsj, and the technician may add a field comment "user data" to the field name.
However, it appears that there are still a large number of field names for missing field annotations. The field names of the missing field annotations can only be understood manually by technicians, and particularly, the processing efficiency and the accuracy are low for the technicians which do not participate in the original development process.
In this embodiment, the field name of the missing field comment may be used as the field name to be processed. As mentioned above, in the present embodiment, the source of the field name to be processed is not limited.
In this embodiment, the field names may be subjected to pinyin word segmentation to obtain pinyin sequences.
The pinyin word segmentation means that the field names are divided into character groups with pinyin significance. In this embodiment, the pinyin sequence may include at least one character set. Preferably, the pinyin word segmentation can segment the field names into minimum units having pinyin significance, and the minimum units are used as character groups. In this case, a single character group contains a pinyin full pinyin or a pinyin shorthand.
For a field name, it may be a complete pinyin, a complete pinyin shorthand, or a mixture of pinyin complete pinyin and shorthand. In this embodiment, under the condition that the field name contains the pinyin shorthand, a character group containing a single character (i.e., the pinyin shorthand) may exist in the pinyin sequence; and under the condition that the field names contain the pinyin complete spelling, the pinyin sequence may contain a character group of the pinyin complete spelling corresponding to the Chinese character.
In addition, the pinyin word segmentation keeps the original sequence of each character in the field name, and on the basis of the sequence, the pinyin sequence corresponding to the field name can be generated through the pinyin word segmentation.
For example, the field name is wm, and a pinyin sequence [ w, m ] can be obtained after pinyin word segmentation; the field name is jiaotong, and a pinyin sequence (jiao, tong) can be obtained after pinyin word segmentation; the field name is ddan, and a pinyin sequence [ d, dan ] can be obtained after pinyin word segmentation.
Based on the spelling sequence obtained by performing pinyin word segmentation on the field names, in this embodiment, the chinese sequence corresponding to the pinyin sequence can be determined according to the mapping relationship between pinyin full-pinyin and/or pinyin shorthand and chinese.
It should be understood that, in the embodiment, the mapping relationship between the pinyin complete pinyin and the chinese language and the mapping relationship between the pinyin short writing and the chinese language can be preset at the same time. In the process of determining the Chinese sequence corresponding to the pinyin sequence, the mapping relation required to be based on the content actually contained in the pinyin sequence can be flexibly determined.
For example, if the pinyin sequence only contains pinyin full-pinyin, only the mapping relationship between the pinyin full-pinyin and the Chinese character can be activated; if the pinyin sequence only contains pinyin shorthand, only the mapping relation between the pinyin shorthand and Chinese can be started; if the pinyin sequence contains both pinyin full-pinyin and pinyin short-write, the mapping relation between pinyin full-pinyin and Chinese and the mapping relation between pinyin short-write and Chinese can be started at the same time.
The mapping relationship between the pinyin full pinyin and/or pinyin shorthand and Chinese in different industry fields may not be completely the same. In this embodiment, the chinese sequence corresponding to the pinyin sequence of the field name may be determined based on the mapping relationship between the pinyin perfect pinyin and/or the pinyin abbreviation and the chinese in the target industry field according to the target industry field to which the field name belongs.
On the basis, field comments corresponding to the field names can be generated according to the Chinese sequence. Usually, the Chinese characters in the Chinese sequence are spliced to generate the field comments corresponding to the field names.
In this embodiment, the field name may be divided into the smallest unit having the Pinyin significance to generate a Pinyin sequence of the field, and the smallest unit may be understood to determine the Chinese character corresponding to the smallest unit, and further generate a Chinese sequence corresponding to the Pinyin sequence, and according to the Chinese sequence, a field annotation corresponding to the field name may be generated. So that the field names can be understood more accurately.
In the above or following embodiments, in order to implement the pinyin word segmentation on the field names, the field names may be input into the pinyin word segmentation model; in the pinyin word segmentation model, at least one string of continuous characters which are in line with the pinyin full-spelling characteristics and at least one single character which is not in line with the pinyin full-spelling characteristics in the field names are identified and respectively used as character groups to generate pinyin sequences corresponding to the field names.
In the pinyin word segmentation model, pinyin identification can be performed on the field names, so that pinyin full pinyin and pinyin shorthand contained in the field names are determined. For example, a single character that does not belong to a pinyin full pinyin may be determined to be a pinyin shorthand.
The pinyin full-spelling feature can be a probability requirement that a string of continuous characters can form the pinyin full-spelling. Matching pinyin-pinyin features may mean that a string of consecutive characters matches the probability requirement that a pinyin-pinyin can be made.
In this embodiment, in order to determine at least one string of continuous characters that conforms to the pinyin full-spelling characteristics and at least one single character that does not conform to the pinyin full-spelling characteristics, the probability that a character conforms to the state characteristics of each spelling position in the pinyin full-spelling may be determined for each character in the field names according to the context of the character and the character itself. Wherein, the state characteristics of all spelling positions under all spellings of all spellings can be used as the spelling characteristics of all spellings. Of course, in this embodiment, the pinyin full-spelling feature is not limited thereto, and the pinyin full-spelling feature may be represented from other angles.
For a pinyin full pinyin, the pinyin full pinyin comprises a plurality of spelling positions, and the spelling positions can be a starting position, a middle position or an ending position. For example, the pinyin jiao, character j at the start position, characters i and a at intermediate positions, and o at the end position.
In this embodiment, the probability that each character in the field name meets the state characteristics of each spelling position under the pinyin full-spelling can be determined. The probability is influenced by the context of the character as well as the character itself.
And if the probability that the character meets the state characteristics of any spelling position under the pinyin full-spelling meets the preset condition, determining that the character meets the pinyin full-spelling characteristics. The character is divided into character groups corresponding to a pinyin total pinyin.
And if the probability that the character meets the state characteristics of all spelling positions under the pinyin full-spelling does not meet the preset condition, determining the character as a single character which does not meet the pinyin full-spelling characteristics. The characters will individually constitute a character group.
In order to make the pinyin word segmentation model learn the pinyin full-spelling characteristics, in this embodiment, the pinyin word segmentation model may be trained by using a training text. Taking the pinyin full-spelling characteristics as the state characteristics of each spelling position as an example, in the model training process:
the training text can be obtained and converted into a pinyin full spelling to obtain a training sequence; marking the spelling position of the character in the training sequence; and inputting the marked training sequence into a pinyin word segmentation model so that the pinyin word segmentation model can learn the state characteristics of all spelling positions under pinyin full spelling as the pinyin full spelling characteristics.
The text in the network can be acquired by means of a crawler technology and the like. In addition, the obtained training texts can be classified according to the industry fields, and parameters of the pinyin word segmentation model can be respectively trained aiming at different industry fields, so that different pinyin full-spelling characteristics can be learned by the pinyin word segmentation model in different industry fields.
In practical application, the pinyin word segmentation model can adopt an HMM model. Based on the HMM model, the labeled training sequence can be input into the HMM model, so that the HMM model can learn the model parameters of the pinyin full-spelling characteristics such as a state transition matrix.
When a field name is divided into words by pinyin, in the HMM model, the probability that a certain character matches the state features of each spelling position can be calculated according to the hidden state of the character and the previous character of the character, that is, the probability that the character is at the start position, the middle position and/or the end position of the pinyin full spelling can be calculated according to the context of the character and the character itself.
For example, if the probability that the character is at the beginning position of the pinyin full-spelling is higher than a preset probability threshold, the character can be determined to be the character at the beginning position in the pinyin full-spelling. And the next character is continuously identified, generally, the probability that the next character meets the state characteristics of the middle position or the end position is higher, and the spelling position of the next character can be determined according to the actual situation.
For another example, if the probability of the spelling position of the character is lower than the preset probability threshold, the character can be determined to be pinyin shorthand, that is, not belonging to any pinyin full spelling.
As mentioned above, the state feature of each spelling position in the pinyin full spelling is only an exemplary implementation form of the pinyin full spelling feature, in this embodiment, the pinyin full spelling feature can be further represented from other angles, and different angles of the training text can be labeled according to different pinyin full spelling features, so that the pinyin word segmentation model learns the pinyin full spelling feature from the training text.
In the embodiment, the pinyin full-spelling characteristics can be learned based on the pinyin word segmentation model, and the pinyin full-spelling and the pinyin shorthand contained in the field names can be distinguished based on the pinyin full-spelling characteristics, so that the field names can be divided into the minimum units with pinyin significance, the accuracy and the reasonability of pinyin word segmentation are ensured, and the understanding result of the subsequent pinyin understanding stage is more accurate.
In the above or following embodiments, the pinyin sequence includes at least one character set. In order to determine the chinese sequence corresponding to the pinyin sequence, in this embodiment, the pinyin sequence may be input to a pinyin understanding model, and in the pinyin understanding model, based on a mapping relationship between pinyin full pinyin and/or pinyin shorthand and chinese, the chinese corresponding to at least one character group in the pinyin sequence is determined; and forming a Chinese sequence according to the Chinese corresponding to at least one character group.
In the pinyin understanding model, the mapping relation between different pinyin complete pinyin and/or pinyin short writing and Chinese can be learned according to different industry fields. The target industry field can be input into the pinyin understanding model, and in the pinyin understanding model, the Chinese corresponding to at least one character group in the pinyin sequence can be determined based on the mapping relation between the pinyin complete pinyin and/or the pinyin short writing and the Chinese in the target industry field, wherein the mapping relation is described by the field names.
In order to enable the pinyin understanding model to learn the mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese in different industry fields, the pinyin understanding model can be trained. The training process may be:
acquiring a training text, and dividing the training text into single character sequences;
converting the single character sequence into a pinyin full-spelling sequence and a pinyin abbreviated sequence;
and training a pinyin understanding model by taking the pinyin full-pinyin sequence and the pinyin abbreviated sequence as input and the training text as output so that the pinyin understanding model learns the mapping relation between the pinyin full-pinyin, the pinyin short-hand and Chinese.
The training texts can be acquired from the network by adopting a crawler technology and the like, and can be classified according to the industry fields, and the pinyin understanding model is trained by using the training texts which are not identical aiming at different industry fields.
In this embodiment, the training text may be divided into single character sequences, which ensures that the pinyin understanding model can learn the mapping relationship between pinyin full pinyin and pinyin shorthand and Chinese single characters, and further ensures that the Chinese character corresponding to each character group in the pinyin sequence of the field names can be determined in the process of understanding the field names by using the pinyin understanding model, wherein each character group corresponds to one Chinese single character.
In practical application, the pinyin understanding model can adopt a seq2seq model. Of course, other types of algorithm models can be used for the pinyin understanding model, and the embodiment is not limited thereto.
In this embodiment, based on the pinyin understanding model, the mapping relationship between the pinyin full pinyin and the pinyin shorthand and the Chinese single character can be synchronously learned, so that the pinyin understanding model can process the field names which adopt the complete pinyin full pinyin, the complete pinyin shorthand or the mixed pinyin full pinyin and shorthand. Moreover, because the training text which is divided into single characters is used in the training process of the pinyin understanding model, the adaptability of the pinyin understanding model to the pinyin sequence can be ensured, and the accuracy of pinyin understanding is further improved.
In the above or below embodiments, the field name may contain a separator character.
In this embodiment, if the field name includes a separation character, the field name may be divided into a plurality of field segments according to the separation character; aiming at a plurality of character segments, the pinyin word segmentation and pinyin understanding operations are respectively executed to obtain respective Chinese sequences of the plurality of character segments.
The segmented characters in the field names serve in most cases as semantic segmentation. For example, the separator character "_" in the field name jggj/dqdm serves as a semantic partition that partitions the semantics of the field name into native country and region codes.
In this embodiment, the field names may be divided according to the separator characters, and the understanding results may be generated for a plurality of the divided field. On the basis, the Chinese sequences of the character segments can be spliced to generate the field annotation corresponding to the field name.
In addition, in the present embodiment, the separator characters in the field names may be retained in the field comments of the field names, or may be deleted directly and no longer appear in the field comments. This can be flexibly set according to actual requirements or user instructions, and the present embodiment does not limit this.
In the embodiment, the field names can be understood in a segmented manner, so that the field names can be understood more accurately, especially for the field names containing multiple semantics, the mutual influence among different semantics can be avoided, the multiple semantics contained in the field names are effectively ensured to obtain the most accurate understanding result, and the accuracy of the finally generated field annotations is effectively improved.
In the above or below embodiments, based on the field comments generated for the field names, the field comments corresponding to the field names may be supplemented into the database in which the field names are located.
Accordingly, the generated field annotation can be applied to the database, and the field annotation is added to the field name in the database.
In this embodiment, the association relationship between the field names and the field annotations in the database may also be constructed based on the field annotations corresponding to the field names and the field annotations corresponding to other field names in the database where the field names are located.
Based on the method, the association relationship between the field names and the field comments can be used as an intermediary in the process of accessing the database, so that the visitor can be ensured to correctly understand the meaning of each field name in the database.
In practical application, the association relationship between the field names and the field comments can be configured in a related data access protocol, so that the communication parties perform data processing according to the same understanding basis.
Of course, the application of the field annotation is by no means limited to this, and in the present embodiment, the generated field annotation may also be applied to other processing items, which are not exhaustive here.
Fig. 3 is a flowchart illustrating a method for understanding a character string according to another exemplary embodiment of the present application. The character string understanding method provided by the embodiment may be executed by a character string understanding apparatus, which may be implemented as software or as a combination of software and hardware, and may be integrally provided in a computing device. As shown in fig. 3, the method includes:
step 300, acquiring a character string to be understood;
301, performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence;
step 302, determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese;
and 303, generating an understanding result corresponding to the character string to be understood according to the Chinese sequence.
The character string understanding method provided by the embodiment may be applied to a scene in which a character string with an unknown meaning is understood, for example, a database scene, a spreadsheet scene, a chat scene, a periodical translation or reading scene, a search engine scene, a shopping mall scene, and the like.
The type of string to be understood may not be exactly the same for different application scenarios. The string to be understood may include one or more of a field name in a database, a string in chat content, a specialized term, and a search keyword. For example, in a database scenario, the string to be understood may be a field name, and in a spreadsheet scenario, the string may be the content in any cell. In other scenarios, the character string to be understood may also be a string of characters in a text, or a sentence of code in a code file, etc.
Accordingly, in this embodiment, the character string to be understood may be any character string with unknown meaning, and the source, specification, type, and the like of the character string to be understood are not limited in this embodiment.
The character string understanding method provided by the embodiment can realize the reduction of the character string with unknown meaning. For example, in the IM tool, when a chat is typed, the abbreviated character string in the chat content is restored; academic journals or professional journals, such as hospital journals, for example, in which the abbreviations for the terms are reduced; and restoring the key word abbreviation in a search scene, such as commodity key words in a shopping mall scene or search key words in a search engine.
The present embodiment differs from the embodiment shown in fig. 1 in that the character string to be understood in the present embodiment is not limited to the field names in the foregoing embodiments.
Based on similar inventive concepts, the technical details in the present embodiment may refer to the related descriptions in the embodiments of the final understanding result generation method, and the detailed technical details will not be expanded for the sake of brevity, which should not cause a loss of the protection scope of the present application.
Only a few representative embodiments are described below by way of example.
In an alternative embodiment, the pinyin sequence includes at least one character group, and the character group includes a pinyin full pinyin or a pinyin short writing.
In an alternative embodiment, the step of performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence includes:
inputting a character string to be understood into a pinyin word segmentation model;
in the pinyin word segmentation model, at least one string of continuous characters which are in line with the pinyin full-spelling characteristics and at least one single character which is not in line with the pinyin full-spelling characteristics in the character string to be understood are identified and respectively used as character groups to generate a pinyin sequence corresponding to the character string to be understood.
In an alternative embodiment, the step of identifying at least one string of consecutive characters in the string of characters to be understood that matches the pinyin full-pinyin feature and at least one single character that does not match the pinyin full-pinyin feature includes:
aiming at each character in the character string to be understood, determining the probability that the character accords with the state characteristics of each spelling position under the pinyin full-spelling according to the context of the character and the character;
if the probability that the character meets the state characteristics of any spelling position under the pinyin full-spelling meets the preset condition, determining that the character meets the pinyin full-spelling characteristics;
and if the probability that the character meets the state characteristics of all spelling positions under the pinyin full-spelling does not meet the preset condition, determining the character as a single character which does not meet the pinyin full-spelling characteristics.
In an alternative embodiment, before inputting the character string to be understood into the pinyin word segmentation model, the steps further include:
acquiring a training text, and converting the training text into a pinyin full spelling to obtain a training sequence;
marking the spelling position of the character in the training sequence;
and inputting the marked training sequence into a pinyin word segmentation model so that the pinyin word segmentation model can learn the state characteristics of all spelling positions under pinyin full spelling as the pinyin full spelling characteristics.
In an alternative embodiment, the pinyin word segmentation model employs a hidden markov HMM model.
In an optional embodiment, the step of determining a chinese sequence corresponding to the pinyin sequence based on a mapping relationship between the pinyin full pinyin and/or the pinyin short script and the chinese includes:
inputting the pinyin sequence into a pinyin understanding model;
in the pinyin understanding model, determining Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and the Chinese; and forming a Chinese sequence according to the Chinese corresponding to at least one character group.
In an alternative embodiment, the step of inputting the pinyin sequence into the pinyin understanding model includes:
determining a target industry field where a character string to be understood is located;
inputting a pinyin sequence and a target industry field into a pinyin understanding model;
determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin shorthand and the Chinese, comprising:
and determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese in the target industry field.
In an alternative embodiment, the step, before inputting the pinyin sequence into the pinyin understanding model, further includes:
acquiring a training text, and dividing the training text into single character sequences;
converting the single character sequence into a pinyin full-spelling sequence and a pinyin abbreviated sequence;
and training a pinyin understanding model by taking the pinyin full-pinyin sequence and the pinyin abbreviated sequence as input and the training text as output so that the pinyin understanding model learns the mapping relation between the pinyin full-pinyin, the pinyin short-hand and Chinese.
In an alternative embodiment, the pinyin understanding model employs a sequence-to-sequence seq2seq model.
In an optional embodiment, the method further comprises:
supplementing an understanding result corresponding to the character string to be understood to a database in which the character string to be understood is located; or
And constructing an association relation between the character strings to be understood and the understanding results in the database based on the understanding results corresponding to the character strings to be understood and the understanding results corresponding to other character strings to be understood in the database where the character strings to be understood are located.
In an optional embodiment, before performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence, the method further includes:
if the character string to be understood contains the separating characters, the character string to be understood can be divided into a plurality of character segments according to the separating characters;
aiming at a plurality of character segments, respectively executing pinyin word segmentation on a character string to be understood to obtain a pinyin sequence and determining a Chinese sequence corresponding to the pinyin sequence based on a mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese to obtain respective Chinese sequences of the plurality of character segments;
according to the Chinese sequence, generating a field annotation corresponding to the character string to be understood, wherein the field annotation comprises the following steps:
and splicing the Chinese sequences of the character segments to generate field annotations corresponding to the character strings to be understood.
It should be noted that, the execution subjects of the steps of the character string understanding method provided in the above embodiments may be the same device, or different devices may also be used as the execution subjects of the method. For example, the execution subjects of steps 100 to 102 may be device a; for another example, the execution subject of steps 100 and 101 may be device a, and the execution subject of step 102 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 100, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.
Fig. 4 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in fig. 4, the computing device includes: a memory 40 and a processor 41.
A processor 41, coupled to the memory 40, for executing the computer program in the memory 40 for:
acquiring a field name to be processed;
performing pinyin word segmentation on the field names to obtain pinyin sequences;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating a field annotation corresponding to the field name according to the Chinese sequence.
In an alternative embodiment, the pinyin sequence includes at least one character group, and the character group includes a pinyin full pinyin or a pinyin short writing.
In an alternative embodiment, the processor 41, when performing pinyin word segmentation on the field names to obtain pinyin sequences, is configured to:
inputting the field names into a pinyin word segmentation model;
in the pinyin word segmentation model, at least one string of continuous characters which are in line with the pinyin full-spelling characteristics and at least one single character which is not in line with the pinyin full-spelling characteristics in the field names are identified and respectively used as character groups to generate pinyin sequences corresponding to the field names.
In an alternative embodiment, the processor 41, when identifying at least one string of consecutive characters in the field name that match the pinyin full-spelling character and at least one single character that do not match the pinyin full-spelling character, is configured to:
aiming at each character in the field name, determining the probability that the character accords with the state characteristics of each spelling position under the pinyin full-spelling according to the context of the character and the character;
if the probability that the character meets the state characteristics of any spelling position under the pinyin full-spelling meets the preset condition, determining that the character meets the pinyin full-spelling characteristics;
and if the probability that the character meets the state characteristics of all spelling positions under the pinyin full-spelling does not meet the preset condition, determining the character as a single character which does not meet the pinyin full-spelling characteristics.
In an alternative embodiment, the processor 41 is further configured to, before entering the field names into the pinyin word segmentation model:
acquiring a training text, and converting the training text into a pinyin full spelling to obtain a training sequence;
marking the spelling position of the character in the training sequence;
and inputting the marked training sequence into a pinyin word segmentation model so that the pinyin word segmentation model can learn the state characteristics of all spelling positions under pinyin full spelling as the pinyin full spelling characteristics.
In an alternative embodiment, the pinyin word segmentation model employs a hidden markov HMM model.
In an alternative embodiment, the processor 41, when determining the chinese sequence corresponding to the pinyin sequence based on the mapping between the pinyin full-pinyin and/or pinyin shorthand and chinese, is configured to:
inputting the pinyin sequence into a pinyin understanding model;
in the pinyin understanding model, determining Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and the Chinese; and forming a Chinese sequence according to the Chinese corresponding to at least one character group.
In an alternative embodiment, the processor 41, when entering the pinyin sequence into the pinyin understanding model, is configured to:
determining a target industry field where the field name is located;
inputting a pinyin sequence and a target industry field into a pinyin understanding model;
when determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese, the method is used for:
and determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese in the target industry field.
In an alternative embodiment, the processor 41 is further configured to, prior to entering the pinyin sequence into the pinyin understanding model:
acquiring a training text, and dividing the training text into single character sequences;
converting the single character sequence into a pinyin full-spelling sequence and a pinyin abbreviated sequence;
and training a pinyin understanding model by taking the pinyin full-pinyin sequence and the pinyin abbreviated sequence as input and the training text as output so that the pinyin understanding model learns the mapping relation between the pinyin full-pinyin, the pinyin short-hand and Chinese.
In an alternative embodiment, the pinyin understanding model employs a sequence-to-sequence seq2seq model.
In an alternative embodiment, processor 41 is further configured to:
supplementing the field comments corresponding to the field names to a database where the field names are located; or
And constructing the association relationship between the field names and the field annotations in the database based on the field annotations corresponding to the field names and the field annotations corresponding to other field names in the database where the field names are located.
In an alternative embodiment, the processor 41 is further configured to, before performing pinyin word segmentation on the field names to obtain pinyin sequences:
if the field name contains the separation character, the field name can be divided into a plurality of character segments according to the separation character;
aiming at a plurality of character segments, respectively executing pinyin word segmentation on the name of the character segment to obtain a pinyin sequence and determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese to obtain the respective Chinese sequence of the plurality of character segments;
according to the Chinese sequence, generating a field annotation corresponding to the field name, which comprises the following steps:
and splicing the Chinese sequences of the character segments to generate field comments corresponding to the field names.
It should be noted that, for the above technical details in the embodiments of the computing device, reference may be made to the related description in the embodiments of the field annotation generation method, and for the sake of brevity, no further description is provided here, but this should not cause a loss of the scope of the present application.
Further, as shown in fig. 4, the computing device further includes: communication components 42, power components 43, and the like. Only some of the components are schematically shown in fig. 4, and the computing device is not meant to include only the components shown in fig. 4.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computing device in the foregoing method embodiments when executed.
Fig. 5 is a schematic structural diagram of another computing device according to yet another embodiment of the present application. As shown in fig. 5, the computing device includes: a memory 50 and a processor 51.
A processor 51, coupled to the memory 50, for executing the computer program in the memory 50 for:
acquiring a character string to be understood;
performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating an understanding result corresponding to the character string to be understood according to the Chinese sequence.
In an alternative embodiment, the pinyin sequence includes at least one character group, and the character group includes a pinyin full pinyin or a pinyin short writing.
In an alternative embodiment, the processor 51, when performing pinyin word segmentation on the string to be understood to obtain a pinyin sequence, is configured to:
inputting a character string to be understood into a pinyin word segmentation model;
in the pinyin word segmentation model, at least one string of continuous characters which are in line with the pinyin full-spelling characteristics and at least one single character which is not in line with the pinyin full-spelling characteristics in the character string to be understood are identified and respectively used as character groups to generate a pinyin sequence corresponding to the character string to be understood.
In an alternative embodiment, the processor 51, when identifying at least one string of consecutive characters in the string of characters to be understood that matches the pinyin full-pinyin feature and at least one single character that does not match the pinyin full-pinyin feature, is configured to:
aiming at each character in the character string to be understood, determining the probability that the character accords with the state characteristics of each spelling position under the pinyin full-spelling according to the context of the character and the character;
if the probability that the character meets the state characteristics of any spelling position under the pinyin full-spelling meets the preset condition, determining that the character meets the pinyin full-spelling characteristics;
and if the probability that the character meets the state characteristics of all spelling positions under the pinyin full-spelling does not meet the preset condition, determining the character as a single character which does not meet the pinyin full-spelling characteristics.
In an alternative embodiment, the processor 51 is further configured to, before inputting the character string to be understood into the pinyin word segmentation model:
acquiring a training text, and converting the training text into a pinyin full spelling to obtain a training sequence;
marking the spelling position of the character in the training sequence;
and inputting the marked training sequence into a pinyin word segmentation model so that the pinyin word segmentation model can learn the state characteristics of all spelling positions under pinyin full spelling as the pinyin full spelling characteristics.
In an alternative embodiment, the pinyin word segmentation model employs a hidden markov HMM model.
In an alternative embodiment, the processor 51, when determining a chinese sequence corresponding to a pinyin sequence based on a mapping between pinyin full-pinyin and/or pinyin shorthand and chinese, is configured to:
inputting the pinyin sequence into a pinyin understanding model;
in the pinyin understanding model, determining Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and the Chinese; and forming a Chinese sequence according to the Chinese corresponding to at least one character group.
In an alternative embodiment, the processor 51, when entering the pinyin sequence into the pinyin understanding model, is configured to:
determining a target industry field where a character string to be understood is located;
inputting a pinyin sequence and a target industry field into a pinyin understanding model;
when determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese, the method is used for:
and determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese in the target industry field.
In an alternative embodiment, the processor 51 is further configured to, prior to entering the pinyin sequence into the pinyin understanding model:
acquiring a training text, and dividing the training text into single character sequences;
converting the single character sequence into a pinyin full-spelling sequence and a pinyin abbreviated sequence;
and training a pinyin understanding model by taking the pinyin full-pinyin sequence and the pinyin abbreviated sequence as input and the training text as output so that the pinyin understanding model learns the mapping relation between the pinyin full-pinyin, the pinyin short-hand and Chinese.
In an alternative embodiment, the pinyin understanding model employs a sequence-to-sequence seq2seq model.
In an optional embodiment, the method is further for:
supplementing an understanding result corresponding to the character string to be understood to a database in which the character string to be understood is located; or
And constructing an association relation between the character strings to be understood and the understanding results in the database based on the understanding results corresponding to the character strings to be understood and the understanding results corresponding to other character strings to be understood in the database where the character strings to be understood are located.
In an alternative embodiment, the processor 51 is further configured to, before performing pinyin word segmentation on the string to be understood to obtain the pinyin sequence:
if the character string to be understood contains the separating characters, the character string to be understood can be divided into a plurality of character segments according to the separating characters;
aiming at a plurality of character segments, respectively executing pinyin word segmentation on a character string to be understood to obtain a pinyin sequence and determining a Chinese sequence corresponding to the pinyin sequence based on a mapping relation between pinyin full pinyin and/or pinyin shorthand and Chinese to obtain respective Chinese sequences of the plurality of character segments;
according to the Chinese sequence, generating a field annotation corresponding to the character string to be understood, wherein the field annotation comprises the following steps:
and splicing the Chinese sequences of the character segments to generate field annotations corresponding to the character strings to be understood.
It should be noted that, for the sake of brevity, the above description of the technical details in the embodiments of the computing device may refer to the related descriptions in the embodiments of the string understanding method, which should not be repeated herein, but should not cause a loss of the scope of the present application.
Further, as shown in fig. 5, the computing device further includes: communication components 52, power components 53, and the like. Only some of the components are schematically shown in fig. 5, and the computing device is not meant to include only the components shown in fig. 5.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computing device in the foregoing method embodiments when executed.
The memories of fig. 4 and 5 are used, among other things, to store computer programs and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Wherein the communication components of fig. 4 and 5 are configured to facilitate wired or wireless communication between the device in which the communication components are located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply components of fig. 4 and 5, among other things, provide power to the various components of the device in which the power supply components are located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (28)

1. A field comment generation method, comprising:
acquiring a field name to be processed;
performing pinyin word segmentation on the field names to obtain pinyin sequences;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating a field annotation corresponding to the field name according to the Chinese sequence.
2. The method of claim 1, wherein the pinyin sequence includes at least one character set, and wherein the character set includes a pinyin full pinyin or a pinyin shorthand.
3. The method of claim 2, wherein the pinyin word segmentation of the field names to obtain pinyin sequences comprises:
inputting the field names into a pinyin word segmentation model;
in the pinyin word segmentation model, at least one string of continuous characters which accord with pinyin full-spelling characteristics and at least one single character which does not accord with the pinyin full-spelling characteristics in the field names are identified and respectively used as character groups to generate pinyin sequences corresponding to the field names.
4. The method of claim 3, wherein the identifying at least one string of consecutive characters in the field name that match a pinyin full-spelling character and at least one single character that does not match the pinyin full-spelling character comprises:
for each character in the field name, determining the probability that the character accords with the state characteristics of each spelling position under the full spelling of the pinyin according to the context of the character and the character;
if the probability that the character meets the state characteristics of any spelling position under the pinyin full-spelling meets the preset condition, determining that the character meets the pinyin full-spelling characteristics;
and if the probability that the character meets the state characteristics of all spelling positions under the pinyin full-spelling does not meet the preset condition, determining that the character is a single character which does not meet the pinyin full-spelling characteristics.
5. The method of claim 4, further comprising, prior to entering the field names into the pinyin word segmentation model:
acquiring a training text, and converting the training text into a pinyin full spelling to acquire a training sequence;
marking the spelling position of the character in the training sequence;
and inputting the marked training sequence into the pinyin word segmentation model so that the pinyin word segmentation model can learn the state characteristics of all spelling positions under the pinyin full spelling as the pinyin full spelling characteristics.
6. The method of claim 3, wherein the pinyin word-segmentation model employs a hidden Markov HMM model.
7. The method of claim 1, wherein determining the chinese sequence corresponding to the pinyin sequence based on a mapping between pinyin pan-tilt and/or pinyin shorthand and chinese comprises:
inputting the pinyin sequence into a pinyin understanding model;
in the pinyin understanding model, determining Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and the Chinese; and forming the Chinese sequence according to the Chinese corresponding to the at least one character group.
8. The method of claim 7, wherein the inputting the pinyin sequence into a pinyin understanding model comprises:
determining a target industry field where the field name is located;
inputting the pinyin sequence and the target industry field into the pinyin understanding model;
the determining of the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relationship between pinyin full pinyin and/or pinyin shorthand and Chinese comprises:
and determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese in the target industry field.
9. The method of claim 7, further comprising, prior to entering the pinyin sequence into a pinyin understanding model:
acquiring a training text, and dividing the training text into single character sequences;
converting the single character sequence into a pinyin full-spelling sequence and a pinyin abbreviated sequence;
and training the pinyin understanding model by taking the pinyin full-pinyin sequence and the pinyin abbreviated sequence as input and the training text as output so that the pinyin understanding model learns the mapping relation between the pinyin full-pinyin, the pinyin short-hand and Chinese.
10. The method of claim 7, wherein the pinyin understanding model employs a sequence-to-sequence seq2seq model.
11. The method of claim 1, further comprising:
supplementing the field annotation corresponding to the field name to a database where the field name is located; or
And constructing an association relation between the field names and the field annotations in the database based on the field annotations corresponding to the field names and the field annotations corresponding to other field names in the database where the field names are located.
12. The method of claim 1, further comprising, prior to performing pinyin word segmentation on the field names to obtain pinyin sequences:
if the field name contains the separation character, the field name can be divided into a plurality of character segments according to the separation character;
aiming at a plurality of character segments, respectively executing the operation of carrying out pinyin word segmentation on the field names to obtain pinyin sequences and the operation of determining Chinese sequences corresponding to the pinyin sequences based on the mapping relation between pinyin full-pinyin and/or pinyin short-hand and Chinese to obtain the respective Chinese sequences of the plurality of character segments;
generating a field annotation corresponding to the field name according to the Chinese sequence, wherein the field annotation comprises:
and splicing the Chinese sequences of the character segments to generate field comments corresponding to the field names.
13. A computing device, when characterized for use in a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
acquiring a field name to be processed;
performing pinyin word segmentation on the field names to obtain pinyin sequences;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating a field annotation corresponding to the field name according to the Chinese sequence.
14. The apparatus of claim 13, wherein the pinyin sequence includes at least one character set, and wherein the character set includes a pinyin full pinyin or a pinyin shorthand.
15. The apparatus of claim 14, wherein the processor, in performing pinyin word segmentation on the field names to obtain pinyin sequences, is configured to:
inputting the field names into a pinyin word segmentation model;
in the pinyin word segmentation model, at least one string of continuous characters which accord with pinyin full-spelling characteristics and at least one single character which does not accord with the pinyin full-spelling characteristics in the field names are identified and respectively used as character groups to generate pinyin sequences corresponding to the field names.
16. The apparatus of claim 15, wherein the processor, in identifying the at least one string of consecutive characters in the field name that match a pinyin full-spelling character and the at least one single character that does not match the pinyin full-spelling character, is configured to:
for each character in the field name, determining the probability that the character accords with the state characteristics of each spelling position under the full spelling of the pinyin according to the context of the character and the character;
if the probability that the character meets the state characteristics of any spelling position under the pinyin full-spelling meets the preset condition, determining that the character meets the pinyin full-spelling characteristics;
and if the probability that the character meets the state characteristics of all spelling positions under the pinyin full-spelling does not meet the preset condition, determining that the character is a single character which does not meet the pinyin full-spelling characteristics.
17. The apparatus of claim 16, wherein the processor, prior to entering the field names into the pinyin word segmentation model, is further configured to:
acquiring a training text, and converting the training text into a pinyin full spelling to acquire a training sequence;
marking the spelling position of the character in the training sequence;
and inputting the marked training sequence into the pinyin word segmentation model so that the pinyin word segmentation model can learn the state characteristics of all spelling positions under the pinyin full spelling as the pinyin full spelling characteristics.
18. The apparatus of claim 15, wherein the pinyin word-segmentation model employs a hidden markov HMM model.
19. The apparatus of claim 13, wherein the processor, when determining the chinese sequence corresponding to the pinyin sequence based on a mapping between pinyin full-pinyin and/or pinyin shorthand and chinese, is configured to:
inputting the pinyin sequence into a pinyin understanding model;
in the pinyin understanding model, determining Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and the Chinese; and forming the Chinese sequence according to the Chinese corresponding to the at least one character group.
20. The apparatus of claim 19, wherein the processor, when entering the pinyin sequence into a pinyin understanding model, is configured to:
determining a target industry field where the field name is located;
inputting the pinyin sequence and the target industry field into the pinyin understanding model;
when determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relationship between pinyin full pinyin and/or pinyin shorthand and Chinese, the method is used for:
and determining the Chinese corresponding to at least one character group in the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese in the target industry field.
21. The device of claim 19, wherein prior to entering the pinyin sequence into a pinyin understanding model, the processor is further configured to:
acquiring a training text, and dividing the training text into single character sequences;
converting the single character sequence into a pinyin full-spelling sequence and a pinyin abbreviated sequence;
and training the pinyin understanding model by taking the pinyin full-pinyin sequence and the pinyin abbreviated sequence as input and the training text as output so that the pinyin understanding model learns the mapping relation between the pinyin full-pinyin, the pinyin short-hand and Chinese.
22. The apparatus of claim 19, wherein the pinyin understanding model employs a sequence-to-sequence seq2seq model.
23. The device of claim 1, wherein the processor is further configured to:
supplementing the field annotation corresponding to the field name to a database where the field name is located; or
And constructing an association relation between the field names and the field annotations in the database based on the field annotations corresponding to the field names and the field annotations corresponding to other field names in the database where the field names are located.
24. The apparatus of claim 13, wherein the processor, prior to performing pinyin word segmentation on the field names to obtain pinyin sequences, is further configured to:
if the field name contains the separation character, the field name can be divided into a plurality of character segments according to the separation character;
aiming at a plurality of character segments, respectively executing the operation of carrying out pinyin word segmentation on the field names to obtain pinyin sequences and the operation of determining Chinese sequences corresponding to the pinyin sequences based on the mapping relation between pinyin full-pinyin and/or pinyin short-hand and Chinese to obtain the respective Chinese sequences of the plurality of character segments;
generating a field annotation corresponding to the field name according to the Chinese sequence, wherein the field annotation comprises:
and splicing the Chinese sequences of the character segments to generate field comments corresponding to the field names.
25. A character string understanding method, comprising:
acquiring a character string to be understood;
performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating an understanding result corresponding to the character string to be understood according to the Chinese sequence.
26. The method of claim 25, wherein the string to be understood comprises: one or more of a field name in the database, a character string in the chat content, a professional term, and a search keyword.
27. A computing device comprising a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
acquiring a character string to be understood;
performing pinyin word segmentation on the character string to be understood to obtain a pinyin sequence;
determining a Chinese sequence corresponding to the pinyin sequence based on the mapping relation between pinyin full pinyin and/or pinyin short writing and Chinese;
and generating an understanding result corresponding to the character string to be understood according to the Chinese sequence.
28. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform the field annotation generation method of any one of claims 1-12 or the string understanding method of any one of claims 25-26.
CN202010425675.6A 2020-05-19 2020-05-19 Method, device and storage medium for generating field annotation and understanding character string Pending CN113688613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010425675.6A CN113688613A (en) 2020-05-19 2020-05-19 Method, device and storage medium for generating field annotation and understanding character string

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010425675.6A CN113688613A (en) 2020-05-19 2020-05-19 Method, device and storage medium for generating field annotation and understanding character string

Publications (1)

Publication Number Publication Date
CN113688613A true CN113688613A (en) 2021-11-23

Family

ID=78576129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010425675.6A Pending CN113688613A (en) 2020-05-19 2020-05-19 Method, device and storage medium for generating field annotation and understanding character string

Country Status (1)

Country Link
CN (1) CN113688613A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050209844A1 (en) * 2004-03-16 2005-09-22 Google Inc., A Delaware Corporation Systems and methods for translating chinese pinyin to chinese characters
US20160306783A1 (en) * 2014-05-07 2016-10-20 Tencent Technology (Shenzhen) Company Limited Method and apparatus for phonetically annotating text
CN106933972A (en) * 2017-02-14 2017-07-07 杭州数梦工场科技有限公司 The method and device of data element are defined using natural language processing technique
CN108629046A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 A kind of fields match method and terminal device
CN108681536A (en) * 2018-04-27 2018-10-19 青岛大学 A kind of carrier-free steganography method based on Chinese phonetic alphabet multiple mapping
CN109902090A (en) * 2019-02-19 2019-06-18 北京明略软件系统有限公司 Field name acquisition methods and device
CN110162794A (en) * 2019-05-29 2019-08-23 腾讯科技(深圳)有限公司 A kind of method and server of participle
CN110413972A (en) * 2019-07-23 2019-11-05 杭州城市大数据运营有限公司 A kind of table name field name intelligence complementing method based on NLP technology
CN110569505A (en) * 2019-09-04 2019-12-13 平顶山学院 text input method and device
CN111144096A (en) * 2019-12-11 2020-05-12 心医国际数字医疗系统(大连)有限公司 HMM-based pinyin completion training method, completion model, completion method and completion input method
CN111142681A (en) * 2018-11-06 2020-05-12 北京嘀嘀无限科技发展有限公司 Method, system, device and storage medium for determining pinyin of Chinese characters

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050209844A1 (en) * 2004-03-16 2005-09-22 Google Inc., A Delaware Corporation Systems and methods for translating chinese pinyin to chinese characters
US20160306783A1 (en) * 2014-05-07 2016-10-20 Tencent Technology (Shenzhen) Company Limited Method and apparatus for phonetically annotating text
CN106933972A (en) * 2017-02-14 2017-07-07 杭州数梦工场科技有限公司 The method and device of data element are defined using natural language processing technique
CN108681536A (en) * 2018-04-27 2018-10-19 青岛大学 A kind of carrier-free steganography method based on Chinese phonetic alphabet multiple mapping
CN108629046A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 A kind of fields match method and terminal device
CN111142681A (en) * 2018-11-06 2020-05-12 北京嘀嘀无限科技发展有限公司 Method, system, device and storage medium for determining pinyin of Chinese characters
CN109902090A (en) * 2019-02-19 2019-06-18 北京明略软件系统有限公司 Field name acquisition methods and device
CN110162794A (en) * 2019-05-29 2019-08-23 腾讯科技(深圳)有限公司 A kind of method and server of participle
CN110413972A (en) * 2019-07-23 2019-11-05 杭州城市大数据运营有限公司 A kind of table name field name intelligence complementing method based on NLP technology
CN110569505A (en) * 2019-09-04 2019-12-13 平顶山学院 text input method and device
CN111144096A (en) * 2019-12-11 2020-05-12 心医国际数字医疗系统(大连)有限公司 HMM-based pinyin completion training method, completion model, completion method and completion input method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NEVILLE RYANT等: "Automating phonetic measurement: The case of voice onset time", J ACOUST SOC AM, 31 December 2013 (2013-12-31) *
章森, 宗成庆, 陈肇雄, 黄河燕: "语句拼音-汉字转换的智能处理机制分析", 中文信息学报, no. 02, 30 June 1998 (1998-06-30) *
黄昌宁: "中文信息处理中的分词问题", 语言文字应用, no. 01, 15 February 1997 (1997-02-15) *

Similar Documents

Publication Publication Date Title
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN109558479B (en) Rule matching method, device, equipment and storage medium
CN109582799B (en) Method and device for determining knowledge sample data set and electronic equipment
CN111506696A (en) Information extraction method and device based on small number of training samples
US11868714B2 (en) Facilitating generation of fillable document templates
CN110555205A (en) negative semantic recognition method and device, electronic equipment and storage medium
CN113868419B (en) Text classification method, device, equipment and medium based on artificial intelligence
CN112347142B (en) Data processing method and device
CN113282762A (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN116028608A (en) Question-answer interaction method, question-answer interaction device, computer equipment and readable storage medium
CN111241833A (en) Word segmentation method and device for text data and electronic equipment
CN111492364A (en) Data labeling method and device and storage medium
CN110765276A (en) Entity alignment method and device in knowledge graph
CN112732743B (en) Data analysis method and device based on Chinese natural language
CN112765963B (en) Sentence word segmentation method, sentence word segmentation device, computer equipment and storage medium
CN113934834A (en) Question matching method, device, equipment and storage medium
CN113051920A (en) Named entity recognition method and device, computer equipment and storage medium
CN117725895A (en) Document generation method, device, equipment and medium
CN111274812B (en) Figure relation recognition method, equipment and storage medium
CN113688615B (en) Method, equipment and storage medium for generating field annotation and understanding character string
CN109558580B (en) Text analysis method and device
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
CN113688613A (en) Method, device and storage medium for generating field annotation and understanding character string
CN110532391B (en) Text part-of-speech tagging method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination