WO2023238975A1

WO2023238975A1 - Apparatus and method for converting grapheme to phoneme

Info

Publication number: WO2023238975A1
Application number: PCT/KR2022/008366
Authority: WO
Inventors: 김정준; 한창진; 채경수
Original assignee: 주식회사 딥브레인에이아이
Priority date: 2022-06-10
Filing date: 2022-06-14
Publication date: 2023-12-14
Also published as: KR20230170346A; KR102622609B1

Abstract

Disclosed are an apparatus and method for converting a grapheme to a phoneme. An apparatus for converting a grapheme to a phoneme according to one embodiment comprises: a tokenization unit for dividing an input string into tokens; and a phoneme determination unit for determining the phoneme of each token on the basis of the token and tokens directly adjacent to the left and right thereof.

Description

Grapheme-phoneme conversion device and method

It is related to the technology of converting graphemes into phonemes.

Grapheme-to-phoneme (G2P) conversion is a process for inputting text into a speech synthesis model, and serves to improve speech synthesis quality by removing ambiguity about pronunciation. For example, when there is an English word called grapheme, if you express it with the phoneme [??ræfiːm], you can pronounce it without confusion.

In the case of Chinese, graphemes correspond to Chinese characters and phonemes correspond to pinyin, and converting Chinese characters into correct pinyin is an important task in Chinese speech synthesis.

Meanwhile, traditional Chinese grapheme-phoneme conversion was done based on dictionaries and rules, but experts must spend a lot of time creating dictionaries and rules that can cope with various exception cases, and as the number of dictionaries and rules increases, maintenance becomes more difficult. There is a problem that is difficult.

The purpose is to provide a device and method for converting graphemes into phonemes.

A grapheme-phoneme conversion device according to one aspect includes a tokenization unit that divides an input string into tokens; and a phoneme determination unit that determines the phoneme of each token based on each token and its left and right adjacent tokens. may include.

The input string is Chinese, and the token may be a grapheme.

The phoneme determination unit determines whether each token is a single phoneme or a next letter, the single phonetic token determines the phoneme using a matching table or a grapheme-phoneme dictionary, and the next letter token determines the token and the left and right sides of the token. Phonemes can be judged based on adjacent tokens.

The phoneme determination unit may determine the phoneme of each token using a pre-learned multi-perceptron-based phoneme judgment model.

The phoneme determination unit includes an embedding unit that generates a token embedding of each token and generates an input string embedding corresponding to the input string based on the generated token embedding; a moving unit for generating a left-shifted embedding by moving the input string embedding to the left by a predetermined interval and generating a right-shifting embedding by moving the input string embedding to the right by a predetermined interval; The input string embedding, the left shift embedding, and the right shift embedding are stacked, and based on the stacking result, the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token are combined to form a combination token of each token. A combination unit that generates embeddings; and a determination unit that determines the phoneme of each token based on the generated combination token embedding. may include.

The constant interval may be one token.

The shifter generates the left shift embedding by adding a zero padding token to the right of the input string embedding, removing the leftmost token embedding, adding a zero padding token to the left of the input string embedding, and removing the rightmost token embedding. can be removed to create the right-shifted embedding.

The phoneme determination unit includes an embedding unit that generates a token embedding of each token and generates an input string embedding corresponding to the input string based on the generated token embedding; a zero padding addition unit that adds zero padding tokens to the left and right sides of the generated input string embedding; A combination unit that generates a combination token embedding of each token by combining the token embedding of each token and the token embeddings of left and right adjacent tokens of each token based on the input string embedding to which the zero padding token is added; and a determination unit that determines the phoneme of each token based on the generated combination token embedding. may include.

A grapheme-phoneme conversion method according to another aspect includes splitting an input string into tokens; and determining the phoneme of each token based on each token and its left and right adjacent tokens. may include.

The input string is Chinese, and the token may be a grapheme.

The step of determining the phoneme of each token is to determine whether each token is a single phoneme or a polyphone, the monophone token is determined as a phoneme using a matching table or a grapheme-phoneme dictionary, and the polyphonic token is the phoneme. The phoneme can be judged based on the token and its left and right adjacent tokens.

In the step of determining the phoneme of each token, the phoneme of each token may be determined using a pre-learned multi-perceptron-based phoneme judgment model.

Determining the phoneme of each token may include generating a token embedding of each token; generating an input string embedding corresponding to the input string based on the generated token embedding; generating a left-shifted embedding by shifting the input string embedding to the left by a predetermined interval; generating a right-shifted embedding by shifting the input string embedding to the right by the predetermined interval; The input string embedding, the left shift embedding, and the right shift embedding are stacked, and based on the stacking result, the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token are combined to form a combination token of each token. generating an embedding; and determining the phoneme of each token based on the generated combination token embedding. may include.

The constant interval may be one token.

The step of generating the left shift embedding includes adding a zero padding token to the right of the input string embedding, removing the leftmost token embedding to generate the left shift embedding, and generating the right shift embedding. The above right-shifted embedding can be created by adding zero padding tokens to the left of the input string embedding and removing the rightmost token embedding.

Determining the phoneme of each token may include generating a token embedding of each token; generating an input string embedding corresponding to the input string based on the generated token embedding; adding zero padding tokens to the left and right sides of the generated input string embedding; generating a combined token embedding of each token by combining the token embedding of each token and the token embeddings of left and right adjacent tokens of each token based on the input string embedding to which the zero padding token is added; and determining the phoneme of each token based on the generated combination token embedding. may include.

When determining the phoneme of a grapheme constituting an input string, only the local context (i.e., the left and right adjacent graphemes of each grapheme) is considered rather than the context of the entire string, so it is a relatively lightweight model and accurately uses a small amount of computation to accurately represent the grapheme. -Phoneme conversion is possible.

Figure 1 is a block diagram showing a grapheme-phoneme conversion device according to an exemplary embodiment.

Figure 2 is an example diagram for explaining the principle of grapheme-phoneme conversion according to an exemplary embodiment.

FIG. 3 is a diagram illustrating an embodiment of the phoneme determination unit 120 of FIG. 1.

Figures 4 and 5 are exemplary diagrams for explaining a phoneme determination process according to an embodiment.

FIG. 6 is a diagram illustrating another embodiment of the phoneme determination unit 120 of FIG. 1.

Figure 7 is an example diagram for explaining a phoneme determination process according to another embodiment.

Figure 8 is a flowchart showing a grapheme-phoneme conversion method according to an exemplary embodiment.

FIG. 9 is a flowchart illustrating an embodiment of a process 820 for determining the phoneme of each token in FIG. 8.

FIG. 10 is a flowchart illustrating another embodiment of the process 820 of determining the phoneme of each token in FIG. 8.

11 is a block diagram illustrating and illustrating a computing environment including a computing device suitable for use in example embodiments.

Hereinafter, an embodiment of the present invention will be described in detail with reference to the attached drawings. When adding reference numerals to components in each drawing, it should be noted that identical components are given the same reference numerals as much as possible even if they are shown in different drawings. Additionally, in describing the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

Meanwhile, in each step, unless a specific order is clearly stated in the context, each step may occur in a different order from the specified order. That is, each step may be performed in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.

The terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification.

Terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms. Terms are used only to distinguish one component from another. Singular expressions include plural expressions unless the context clearly indicates otherwise, and terms such as 'include' or 'have' refer to the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification. It is intended to specify that something exists, but it should be understood as not precluding the possibility of the existence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

In addition, the division of components in this specification is merely a division according to the main function each component is responsible for. That is, two or more components may be combined into one component, or one component may be divided into two or more components for more detailed functions. In addition to the main functions that each component is responsible for, each component may additionally perform some or all of the functions that other components are responsible for, and some of the main functions that each component is responsible for may be performed by other components. It may also be carried out. Each component may be implemented as hardware or software, or as a combination of hardware and software.

FIG. 1 is a block diagram illustrating a grapheme-to-phoneme conversion device according to an exemplary embodiment, and FIG. 2 is an exemplary diagram illustrating the principle of grapheme-to-phoneme conversion according to an exemplary embodiment.

Referring to FIGS. 1 and 2 , the grapheme-phoneme conversion device 100 according to an exemplary embodiment may include a tokenization unit 110 and a phoneme determination unit 120.

The tokenization unit 110 may divide the input string 10 into tokens 11 to 15. At this time, the input string 10 may be Chinese, Korean, English, Japanese, etc., but this is only an example and is not limited thereto. Tokens are logically distinguishable classification elements and may be phrases, words, morphemes, syllables, graphemes, etc. If the input string is Chinese, the token may be a grapheme.

Hereinafter, for convenience of explanation, an example will be given where the input string is Chinese, that is, when the grapheme-phoneme conversion device is a Chinese grapheme-phoneme conversion device.

The phoneme determination unit 120 may determine the phoneme of each token (11 to 15) based on each token (11 to 15) and the left and right adjacent tokens of each token (11 to 15). For example, the phoneme determination unit 120 may determine the phoneme of each token 11 to 15 using a pre-learned phoneme determination model. At this time, the phoneme judgment model may be a multi-perceptron, but is not limited to this.

When determining the phoneme of a token (e.g., 13) constituting the input string 10, the phoneme determination unit 120 according to an exemplary embodiment determines not only the token (e.g., 13) but also the left and right sides of the token (e.g., 13). By considering adjacent tokens (e.g., 12, 14) together and making a decision, effective Chinese grapheme-to-phoneme conversion is possible with a relatively lightweight model.

Chinese, that is, Chinese characters, can be divided into monophones, which have only one pronunciation, and polyphones, which have multiple pronunciations. Single phonemes have only one pronunciation, so they can be easily converted to known phonemes, but polyphonic letters have multiple pronunciations, so their pronunciation is determined by context.

According to one embodiment, the phoneme determination unit 120 determines whether each token is a monophone or a polyphone, and the monophone token is selected from a matching table or grapheme- The phoneme can be determined using a phoneme dictionary, and only for the next token, the phoneme can be judged based on that token and its left and right adjacent tokens. At this time, the matching table or grapheme-phoneme dictionary may be stored in internal or external memory.

Hereinafter, embodiments of the phoneme determination unit 120 will be described in detail with reference to FIGS. 3 to 7.

FIG. 3 is a diagram illustrating an embodiment of the phoneme determination unit 120 of FIG. 1, and FIGS. 4 and 5 are exemplary diagrams for explaining a phoneme determination process according to an embodiment.

Referring to Figures 3 to 5, the phoneme determination unit 120a may include an embedding unit 310, a moving unit 320, a combining unit 330, and a determining unit 340.

The embedding unit 310 may generate embeddings 41 to 45 (hereinafter referred to as token embeddings) of each token 11 to 15 included in the input string 10. Here, the embedding may be called a feature vector or an embedding vector, and may have a predetermined dimension. Figures 4 and 5 show examples of 3D token embedding, but this is only for convenience of explanation and is not particularly limited to the number of dimensions.

According to an exemplary embodiment, the embedding unit 310 may generate token embeddings 41 to 45 of each token 11 to 15 from each token 11 to 15 using an embedding model. At this time, the embedding model is a machine learning model and can be learned in advance to generate embeddings from tokens.

The embedding unit 310 may generate an embedding 46 (hereinafter referred to as an input string embedding) corresponding to the input string 10 based on the generated token embeddings 41 to 45. For example, the embedding unit 310 may generate the input string embedding 46 by arranging the token embeddings 41 to 45 according to the positions of each token 11 to 15 in the input string 10.

The moving unit 320 generates a right-shifted embedding 51 by shifting the input string embedding 46 to the right by a certain interval, and creates a left-shifted embedding by shifting the input string embedding 46 to the left by a certain interval. (52) can be generated. At this time, a certain interval may be one token, but is not limited thereto.

For example, as shown in Figure 5, the moving unit 320 adds a zero padding token ([0, 0, 0]) to the left of the input string embedding 46 and adds the rightmost token embedding ([0.15, 0.09, 0.8]) can be removed to create a right-shifted embedding (51). Additionally, the moving unit 320 adds a zero padding token ([0, 0, 0]) to the right of the input string embedding 46 and removes the leftmost token embedding ([0.1, 0.05, 0.03]), thereby A moving embedding 52 can be created. At this time, the zero padding token may have the same dimension as the token embeddings 41 to 45.

The combination unit 330 stacks the input string embedding 46, the right shift embedding 51, and the left shift embedding 52, and based on the result, the token of each token (11 to 15) is stacked. Combination token embeddings 53 to 57 can be generated by combining the embedding with the token embeddings of the left and right adjacent tokens of each token.

For example, as shown in FIG. 5, the combination unit 330 may sequentially stack the left shift embedding 52, the input string embedding 46, and the right shift embedding 51 from the bottom. In addition, the combination unit 330 extracts and combines token embeddings at the same position from each of the left shift embedding 52, the input string embedding 46, and the right shift embedding 51, thereby creating a combination of each token 11 to 15. Token embeddings 53 to 57 can be created. For example, the combination unit 330 concatenates the extracted token embeddings, that is, the token embeddings of each token (11 to 15) and the token embeddings of the left and right adjacent tokens of each token (11 to 15) to create a combination token. Embeddings 53 to 57 can be created, but are not limited to this.

Meanwhile, Figure 5 shows an embodiment in which the left shift embedding 52, the input string embedding 46, and the right shift embedding 51 can be sequentially stacked from the bottom, but this is only an embodiment and is not limited thereto. no. That is, there is no particular limitation on the order of stacking the left shift embedding (52), the input string embedding (46), and the right shift embedding (51).

The movement of the input string embedding and the generation of the combination token embedding can be expressed as equation 1.

here,

represents the input string embedding, 0 may represent a zero padding token as a zero padding vector, k may represent the movement size, and w may represent the window size. also,

and

may represent left shift embedding and right shift embedding, respectively.

For example, if the movement size k is 1, the window size w is 3,

of

Token embedding is a combination of tokens at a location.

It may contain three token embeddings at positions t-1, t, and t+1.

The leftmost token of can be stacked with the token to its right and its zero padding token, and the rightmost token can be stacked with its left token and its zero padding token.

The determination unit 340 may determine the phonemes 111 to 115 of each token 11 to 15 based on the combined token embeddings 53 to 57 of each token 11 to 15.

According to an exemplary embodiment, the determination unit 340 may determine the phoneme of each token (11 to 15) from the combined token embeddings (53 to 57) of each token (11 to 15) using a phoneme determination model. . At this time, the phoneme judgment model may be a machine learning model trained in advance to determine the phoneme of each token from the combination token embedding of each token. For example, the phoneme judgment model may be a multi-perceptron, but is not limited to this.

According to an example embodiment, the phoneme judgment model may include three fully-connected layers. Among the three fully connected layers, two layers except the last layer include a network unit (e.g., feed-forward network (FFN), convolutional neural network (CNN), etc.) and a normalization unit (e.g., , batch normalization, layer normalization, etc.), activation functions (sigmoid, gelu, relu, tanh, elu, etc.), regularization units (e.g. dropout, etc.), and can include three complete The last layer among the connection layers may include a network unit and a softmax function. For example, if the phoneme judgment model is expressed mathematically, it can be expressed as Equation 2.

here

represents the combined token embedding of each token, L represents the network unit such as FFN or CNN, N represents the normalization unit such as batch normalization or layer normalization, and A represents the activation function such as , sigmoid, gelu, relu, tanh, or elu, etc., R represents a regularization part such as dropout, dict represents the size of the grapheme-phoneme dictionary, Softmax represents the softmax function,

can represent the probability for each class (phoneme candidate).

According to one embodiment, the determination unit 340 may determine whether each token is a monophone or a polyphone. Additionally, the determination unit 340 may determine the phoneme for a single phonetic token using a matching table or a grapheme-phoneme dictionary, and may determine the phoneme using combination token embedding only for the polyphone token.

For example, in Figure 5, the single-character tokens are '開' (11) and '天' (12), and the polyphonic tokens are '??' (13), '地' (14), and '的' (15). ) can be. The judgment unit 340 determines the single phoneme tokens 'open' (11) and '天' (12) for each phoneme (111, 112) using a matching table or grapheme-phoneme dictionary, and determines the next letter token ' ??'(13), '地'(14), and '的'(15) use the phoneme judgment model to identify each phoneme (113, 114, 115) can be judged.

Meanwhile, according to one embodiment, the embedding unit 310 generates token embeddings 41 to 45 for each token 11 to 15 included in the input string 10, and then generates token embeddings 41 to 45 ) can be reduced or increased. In this case, generation and movement of input string embeddings, generation of combination token embeddings, and phoneme determination of each token can be performed for each dimension, and the results performed for each dimension can be integrated to finally determine the phoneme of each token. there is.

FIG. 6 is a diagram illustrating another embodiment of the phoneme determination unit 120 of FIG. 1, and FIG. 7 is an exemplary diagram for explaining a phoneme determination process according to another embodiment.

Referring to FIGS. 6 and 7 , the phoneme determination unit 120b may include an embedding unit 610, a zero padding addition unit 620, a combination unit 630, and a determination unit 640. Here, since the embedding unit 610 and the determination unit 640 are substantially the same as or similar to the embedding unit 310 and the determination unit 340 of FIG. 3, their detailed descriptions will be omitted.

The zero padding addition unit 620 may add zero padding tokens to the left and right sides of the input string embedding 46 generated by the embedding unit 610 to generate an input string embedding 71 to which the zero padding token is added. .

For example, as shown in FIG. 7, the zero padding addition unit 620 adds a zero padding token ([0, 0, 0]) to the left of the input string embedding 46 and a zero padding token ([0, 0, 0]) to the right. By adding [0, 0, 0]), we can generate the input string embedding 71 with zero padding tokens added to the left and right.

The combination unit 630 may generate combination token embeddings 53 to 57 of each token 11 to 15 based on the input string embedding 71 to which a zero padding token is added.

For example, as shown in FIG. 7, the combination unit 630 combines the token embedding of each token 11 to 15 and the token embedding of each token 11 to 15 in the input string embedding 71 to which a zero padding token is added. Token embeddings of left and right adjacent tokens can be extracted and concatenated to generate combined token embeddings (53 to 57) of each token (11 to 15).

When determining the phoneme of each token constituting the input string, the grapheme-phoneme conversion device 100 according to an exemplary embodiment uses local context, that is, the left and right adjacent tokens of each token, rather than the context of the entire string. Since it only considers the phoneme of the token, it is possible to accurately convert the phoneme of the token through a relatively light model and with a small amount of calculation.

Meanwhile, the grapheme-phoneme conversion device 100 according to an exemplary embodiment may be applied to a text-to-speech conversion system. At this time, the text-to-speech conversion system may be a device that receives arbitrary text data and converts the contents of the input text data into speech data.

The grapheme-phoneme conversion method according to the exemplary embodiment of FIG. 8 may be performed by the grapheme-phoneme conversion apparatus 100 of FIG. 1 .

Referring to FIG. 8, the grapheme-phoneme conversion device can split the input string into tokens (810). At this time, the token may be a word, word, morpheme, syllable, grapheme, etc., and if the input string is Chinese, the token may be a grapheme.

The grapheme-phoneme conversion device can determine the phoneme of each token based on each token and its left and right adjacent tokens (820).

For example, the grapheme-phoneme conversion device can determine the phoneme of each token using a pre-learned phoneme judgment model. At this time, the phoneme judgment model may be a multi-perceptron, but is not limited to this.

According to one embodiment, the grapheme-phoneme conversion device determines whether each token is a single phoneme or a polyphone, and the single phoneme token is converted into a phoneme using a matching table or grapheme-phoneme dictionary in which single phoneme tokens are matched with phonemes. , and only for the next token, the phoneme can be judged based on that token and its left and right adjacent tokens.

Referring to FIG. 9, the grapheme-phoneme conversion device may generate a token embedding of each token (910). For example, a grapheme-to-phoneme conversion device can generate a token embedding of each token using an embedding model. At this time, the embedding model is a machine learning model and can be learned in advance to generate embeddings from tokens.

The grapheme-to-phoneme conversion device may generate an input string embedding based on the token embedding of each generated token (920). For example, a grapheme-to-phoneme conversion device may generate an input string embedding by listing the token embedding of each token according to the position of each token in the input string.

The grapheme-phoneme conversion device can create a right-shifted embedding by shifting the input string embedding to the right by a certain interval, and can generate a left-shift embedding by shifting the input string embedding to the left by a certain interval (930). . At this time, a certain interval may be one token, but is not limited thereto.

For example, a grapheme-to-phoneme converter can create a right-shifted embedding by adding zero padding tokens to the left of the input string embedding and removing the rightmost token embedding. Additionally, the grapheme-to-phoneme conversion device can create a left-shifted embedding by adding a zero padding token to the right of the input string embedding and removing the leftmost token embedding.

The grapheme-to-phoneme conversion device stacks the input string embedding, right shift embedding, and left shift embedding, and based on the results, for each token, the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token are combined. A combined token embedding can be generated (940).

For example, the grapheme-to-phoneme converter stacks the left shift embedding, input string embedding, and right shift embedding sequentially from the bottom, and extracts the token embedding at the same position from each of the left shift embedding, input string embedding, and right shift embedding, By combining, we can create a combined token embedding of each token. For example, the grapheme-phoneme conversion device may generate a combination token embedding by concatenating the extracted token embeddings, that is, the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token.

The grapheme-phoneme conversion device may determine the phoneme of each token based on the combination token embedding of each token (950).

According to an example embodiment, the grapheme-to-phoneme conversion device may determine the phoneme of each token from the combined token embedding of each token using a phoneme determination model. At this time, the phoneme judgment model may be a machine learning model trained in advance to determine the phoneme of each token from the combination token embedding of each token. For example, the phoneme judgment model may be a multi-perceptron, but is not limited to this.

According to an exemplary embodiment, the grapheme-phoneme conversion device determines whether each token is a single phoneme or a polyphone, the monophone token determines the phoneme using a matching table or a grapheme-phoneme dictionary, and the polyphone token is determined by using a matching table or a grapheme-phoneme dictionary. The phoneme can be determined using combination token embedding only.

Referring to FIG. 10, the grapheme-phoneme conversion device may generate a token embedding of each token (1010) and generate an input string embedding based on the token embedding of each generated token (1020).

The grapheme-phoneme conversion device may generate an input string embedding with zero padding tokens added by adding zero padding tokens to the left and right sides of the generated input string embedding (1030).

The grapheme-to-phoneme conversion device may generate a combined token embedding of each token based on the input string embedding with zero padding tokens added to the left and right sides (1040).

For example, the grapheme-to-phoneme conversion device extracts and concatenates the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token from the input string embedding with zero padding tokens added to the left and right, and concatenates each token. A combination token embedding can be created.

11 is a block diagram illustrating and illustrating a computing environment including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those described below.

The illustrated computing environment 1100 includes a computing device 1110 . In one embodiment, computing device 1110 may be a grapheme-to-phoneme conversion device 100.

Computing device 1110 may include at least one processor 1111, a computer-readable storage medium 1112, and a communication bus 1113. Processor 1111 may cause computing device 1110 to operate according to the above-mentioned example embodiments. For example, the processor 1110 may execute one or more programs stored in the computer-readable storage medium 1112. One or more programs may include one or more computer-executable instructions, which, when executed by the processor 1111, may be configured to cause computing device 1110 to perform operations according to example embodiments. there is.

Computer-readable storage medium 1112 may be configured to store computer-executable instructions or program code, program data, and/or other suitable forms of information. The program 1114 stored in the computer-readable storage medium 1112 may include a set of instructions executable by the processor 1111. In one embodiment, computer-readable storage medium 1112 includes memory (volatile memory, such as random access memory, non-volatile memory, or an appropriate combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, another form of storage medium that can be accessed by computing device 1110 and store desired information, or a suitable combination thereof.

Communication bus 1113 may interconnect various other components of computing device 1110.

Computing device 1110 may also include one or more input/output interfaces 1115 and one or more network communication interfaces 1116 that provide an interface for one or more input/output devices 1120. The input/output interface 1115 and the network communication interface 1116 may be connected to the communication bus 1113. Input/output device 1120 may be connected to other components of computing device 1110 through input/output interface 1115. Exemplary input/output devices 1120 include, but are not limited to, a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touch screen), a voice or sound input device, various types of sensor devices, and/or an imaging device. It may include input devices and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 1120 may be included within the computing device 1110 as a component that constitutes the computing device 1110, or may be connected to the computing device 1110 as a separate device distinct from the computing device 1110. It may be possible.

So far, the present invention has been examined focusing on its preferred embodiments. A person skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Accordingly, the scope of the present invention is not limited to the above-described embodiments, but should be construed to include various embodiments within the scope equivalent to the content described in the patent claims.

Claims

A tokenization unit that splits the input string into tokens; and

a phoneme determination unit that determines the phoneme of each token based on each token and its left and right adjacent tokens; Including,

A grapheme-to-phoneme converter.
According to paragraph 1,

The input string is Chinese, the token is a character stamp,

A grapheme-to-phoneme converter.
According to paragraph 1,

The phoneme judgment unit,

It is determined whether each token is a single phoneme or a polyphone, the phoneme is determined for the monophone token using a matching table or a grapheme-phoneme dictionary, and the polyphonic token is based on the token and its left and right adjacent tokens. judging phonemes,

A grapheme-to-phoneme converter.
According to paragraph 1,

The phoneme judgment unit,

Determining the phoneme of each token using a pre-trained multi-perceptron-based phoneme judgment model,

A grapheme-to-phoneme converter.
According to paragraph 1,

The phoneme judgment unit,

an embedding unit that generates a token embedding of each token and generates an input string embedding corresponding to the input string based on the generated token embedding;

a moving unit for generating a left-shifted embedding by moving the input string embedding to the left by a predetermined interval and generating a right-shifting embedding by moving the input string embedding to the right by a predetermined interval;

The input string embedding, the left shift embedding, and the right shift embedding are stacked, and based on the stacking result, the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token are combined to form a combination token of each token. A combination unit that generates embeddings; and

a determination unit that determines the phoneme of each token based on the generated combination token embedding; Including,

A grapheme-to-phoneme converter.
According to clause 5,

The regular interval is one token,

A grapheme-to-phoneme converter.
According to clause 5,

The moving part,

Create the left shifted embedding by adding a zero padding token to the right of the input string embedding and removing the leftmost token embedding, and adding a zero padding token to the left of the input string embedding and removing the rightmost token embedding to create the left shift embedding. Generating a right shift embedding,

A grapheme-to-phoneme converter.
According to paragraph 1,

The phoneme judgment unit,

an embedding unit that generates a token embedding of each token and generates an input string embedding corresponding to the input string based on the generated token embedding;

a zero padding addition unit that adds zero padding tokens to the left and right sides of the generated input string embedding;

A combination unit that generates a combination token embedding of each token by combining the token embedding of each token and token embeddings of left and right adjacent tokens of each token based on the input string embedding to which the zero padding token is added; and

a determination unit that determines the phoneme of each token based on the generated combination token embedding; Including,

A grapheme-to-phoneme converter.
splitting the input string into tokens; and

determining the phoneme of each token based on each token and its left and right adjacent tokens; Including,

How to convert grapheme to phoneme.
According to clause 9,

The input string is Chinese, the token is a character stamp,

How to convert grapheme to phoneme.
According to clause 9,

The step of determining the phoneme of each token is,

It is determined whether each token is a single phoneme or a polyphone, the phoneme is determined for the monophone token using a matching table or a grapheme-phoneme dictionary, and the polyphonic token is based on the token and its left and right adjacent tokens. judging phonemes,

How to convert grapheme to phoneme.
According to clause 9,

The step of determining the phoneme of each token is,

Determining the phoneme of each token using a pre-trained multi-perceptron-based phoneme judgment model,

How to convert grapheme to phoneme.
According to clause 9,

The step of determining the phoneme of each token is,

generating a token embedding of each token;

generating an input string embedding corresponding to the input string based on the generated token embedding;

generating a left-shifted embedding by shifting the input string embedding to the left by a predetermined interval;

generating a right-shifted embedding by shifting the input string embedding to the right by the predetermined interval;

The input string embedding, the left shift embedding, and the right shift embedding are stacked, and based on the stacking result, the token embedding of each token and the token embeddings of the left and right adjacent tokens of each token are combined to form a combination token of each token. generating an embedding; and

determining the phoneme of each token based on the generated combination token embedding; Including,

How to convert grapheme to phoneme.
According to clause 13,

The regular interval is one token,

How to convert grapheme to phoneme.
According to clause 13,

The step of generating the left shift embedding is,

Create the left shift embedding by adding a zero padding token to the right of the input string embedding and removing the leftmost token embedding,

The step of generating the right shift embedding is,

creating the right-shifted embedding by adding a zero padding token to the left of the input string embedding and removing the rightmost token embedding,

How to convert grapheme to phoneme.
According to clause 9,

The step of determining the phoneme of each token is,

generating a token embedding of each token;

generating an input string embedding corresponding to the input string based on the generated token embedding;

adding zero padding tokens to the left and right sides of the generated input string embedding;

Generating a combined token embedding of each token by combining the token embedding of each token and the token embeddings of left and right adjacent tokens of each token based on the input string embedding to which the zero padding token is added; and

determining the phoneme of each token based on the generated combination token embedding; Including,

How to convert grapheme to phoneme.