CN112509578A

CN112509578A - Voice information recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112509578A
Application number: CN202011455674.2A
Authority: CN
Inventors: 李婉瑜
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-16

Abstract

The embodiment of the disclosure discloses a method, a device, electronic equipment and a storage medium for identifying voice information, wherein the method comprises the following steps: acquiring at least one text corpus pair, wherein the text corpus pair comprises a mapping relation between a written text and a spoken text; determining the incidence relation between the book-surface text and the spoken text; acquiring written text data, and acquiring spoken text data corresponding to the written text data according to the written text data and the association relation; and using the spoken text material for voice information recognition. According to the technical scheme of the embodiment, the incidence relation between the written text and the spoken text is determined according to the obtained text corpus pair, the corresponding spoken text data is obtained according to the written text data and the incidence relation and is used for recognizing the voice information, the text information suitable for the spoken daily life scene is obtained, and the recognition capability of the spoken voice information is improved.

Description

Voice information recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to voice recognition technologies, and in particular, to a method and an apparatus for recognizing voice information, an electronic device, and a storage medium.

Background

With the continuous progress of science and technology, the voice recognition technology is rapidly developed, is widely applied to a plurality of fields such as intelligent control and the like, and provides great convenience for the social life of people.

The existing voice recognition technology usually obtains written text data from document data such as books, and forms training data according to the corresponding relationship between voice information and the text data, obtains a voice recognition model after training, and then realizes voice recognition through the voice recognition model.

However, such a speech recognition model is not suitable for social languages used in daily life, and the social languages are expressed more randomly than written text data languages and do not completely conform to standard grammatical rules, so that the existing speech recognition methods have poor speech recognition effects in application scenarios of daily life.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device and a storage medium for recognizing speech information, so as to realize recognition of spoken speech information in daily life.

In a first aspect, an embodiment of the present disclosure provides a method for recognizing voice information, including:

acquiring at least one text corpus pair, wherein the text corpus pair comprises a mapping relation between a written text and a spoken text;

determining the incidence relation between the book-surface text and the spoken text;

acquiring written text data, and acquiring spoken text data corresponding to the written text data according to the written text data and the association relation;

and using the spoken text material for voice information recognition.

In a second aspect, an embodiment of the present disclosure provides an apparatus for recognizing voice information, including:

the system comprises a text corpus pair acquisition module, a semantic conversion module and a semantic conversion module, wherein the text corpus pair acquisition module is used for acquiring at least one text corpus pair, and the text corpus pair comprises a mapping relation between a written text and a spoken text;

the incidence relation acquisition module is used for determining the incidence relation between the book-surface text and the spoken text;

the spoken text data acquisition module is used for acquiring written text data and acquiring spoken text data corresponding to the written text data according to the written text data and the incidence relation;

and the voice recognition execution module is used for using the spoken language text data for recognizing the voice information.

In a third aspect, an embodiment of the present disclosure provides an electronic device, which includes a memory, a processing device, and a computer program stored in the memory and executable on the processing device, where the processing device implements a method for recognizing speech information according to any embodiment of the present disclosure when executing the computer program.

In a fourth aspect, embodiments of the present disclosure provide a storage medium containing computer-executable instructions for performing a method of speech information recognition of any of the embodiments of the present disclosure when executed by a computer processor.

According to the technical scheme of the embodiment, the incidence relation between the written text and the spoken text is determined according to the obtained text corpus pair, the corresponding spoken text data is obtained according to the written text data and the incidence relation and is used for recognizing the voice information, the text information suitable for the spoken daily life scene is obtained, and the recognition capability of the spoken voice information is improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is a flow chart of one embodiment of a method of speech information recognition of the present disclosure;

fig. 2 is a block diagram of an embodiment of a speech information recognition apparatus according to the present disclosure;

FIG. 3 is a block diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Example one

Fig. 1 is a flowchart of a method for recognizing voice information according to an embodiment of the present disclosure, where the embodiment is applicable to recognition of voice information in a daily life scenario, and the method may be executed by a device for recognizing voice information according to an embodiment of the present disclosure, where the device may be implemented by software and/or hardware and is integrated in an electronic device, and the method specifically includes the following steps:

s110, obtaining at least one text corpus pair, wherein the text corpus pair comprises a mapping relation between a written text and a spoken text.

The spoken text is a language format used in daily life, has the characteristics of popular and easy understanding and expression, has strong randomness, has a simple language structure, is messy and accords with daily speaking habits; the written text is a language format used in document materials and formal occasions, has the characteristic of strong logicality, has strong normalization and strict and complete grammar structure, and accords with reading habits. A corpus of text comprising a written text and a spoken text reflects the mapping between the written text and the spoken text, e.g., written text "i feel the book good" corresponds to spoken text "i feel the book good".

Optionally, in an embodiment of the present disclosure, the spoken text includes a read-back text, a flip-chip text, a mood word replacement text, a query mode conversion text, a retroactive text, and/or an incoherent text. The method includes the steps that a read-back text is used for replying certain words in a sentence, usually, the words are emphasized and also related to word-using habits of individuals, for example, the read-back text ' we must do the work, when a spoken language narrative ' we must do the work ', a rephrasing phenomenon exists, and a corresponding written text is ' we must do the work '; flip-chip is a grammatical expression method of reversing the order of words such as subjects, predicates, objects, and subjects in a sentence, and is usually used to emphasize and highlight a word, and may be related to the word usage habit of an individual, for example, a flip-chip text "many books are scattered on the ground, horizontal seven and vertical eight" is a postfix text "many books are scattered on the ground, and a corresponding written text" many books are scattered on the ground, horizontal seven and vertical eight "; the tone word is a fictitious word representing tone, usually appearing at the tail of a sentence or a pause in a sentence, representing various tones, for example, the tone word replaces the text "how a meal" and the corresponding written text should be "how a meal"; an interrogative sentence is a classification in the mood of a sentence, which is intended to ask something, so the content of expression is not positive, e.g., written text "this meeting you can't go", the corresponding spoken text might be "this meeting you cannot go"; the retroflex sound is a sound change phenomenon caused by the action of rolling the tongue of the final vowel of some characters, for example, the written text is "painting" and the corresponding spoken text is "painting"; incoherent text is a phenomenon of word breaking or multiple words that occurs when spoken language is expressed, for example, the spoken language expression "we go together", and the corresponding written text is "we go together".

And S120, determining the association relationship between the written text and the spoken text.

Because the text corpus pairs include the mapping relationship between the written text and the spoken text, after one or more text corpus pairs are obtained, all mapping relationships in the text corpus pairs may be summarized, for example, in the form of an association comparison table, each written-to-speech text in the text corpus pairs has a uniquely matched spoken text in the association comparison table, and the association comparison table represents the association relationship between the written text and the spoken text.

Optionally, in an embodiment of the present disclosure, the determining an association relationship between the written text and the spoken text includes: and acquiring a text translation model according to the at least one text corpus pair, wherein the text translation model represents the association relation between the written text and the spoken text. The obtained text corpus pair is used as a training sample, an initial text translation model, such as a neural network model, is subjected to translation training, and the trained text translation model has the capability of translating written texts into spoken texts.

Optionally, in this embodiment of the present disclosure, the obtaining a text translation model according to the at least one text corpus pair includes: and constructing a neural network model based on a Transformer architecture, and performing text translation training on the neural network model based on the Transformer architecture through the at least one text corpus pair to obtain a trained text translation model. The transform architecture is an Encoder-Decoder structure of an Attention mechanism, including 6 mutually stacked Encoder layers and 6 mutually stacked Decoder layers, and outputs a result through an output layer connected to a last layer Decoder layer; the conventional Neural Network model usually adopts Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), only concerns about the correlation between words at close distances, for example, in the written text "we go to school together", because "one" is close to "and thus has strong correlation," go "farther from" and almost has no correlation, but in practical application, between words at different distances, the word distance cannot be used as the criterion of the correlation only, and sometimes the word distance is far and also has strong correlation; for example, the spoken text "we go to school, together" corresponding to the written text above, and as a result, the "going" has a great correlation with the "people"; in the neural network model based on the Transformer architecture adopted by the embodiment of the disclosure, after the written text is obtained, the correlation between each character is the same, the incidence relation between each character is accurately expressed, and the independence of each character is ensured.

S130, acquiring written text data, and acquiring spoken language text data corresponding to the written text data according to the written text data and the association relation.

The written text data is a data set formed by a plurality of written texts, and the written texts are obtained through electronic newspapers, electronic books, web pages and the like and adopt text information expressed in a written form; and acquiring the spoken language text data corresponding to the written text data according to the acquired association relationship (for example, the association comparison table in the technical scheme).

Optionally, in this embodiment of the disclosure, the obtaining of the spoken language text data corresponding to the written text data according to the written text data and the association relationship includes: and inputting the written text data into the text translation model to obtain corresponding spoken text data. According to the technical scheme, the corresponding spoken text data can be obtained after the written text data is translated through the text translation model.

Optionally, in this embodiment of the present disclosure, after obtaining the corresponding spoken language text data, the method further includes: screening the spoken text data according to a preset screening rule to obtain standard spoken text data; the method for using the spoken text material for recognition of voice information comprises the following steps: and using the standard spoken text data for voice information recognition. Because the text translation model has noise or disturbance in training, and the obtained text translation model may have translation deviation, it is necessary to remove deviation items in the translation result according to a preset screening rule, for example, although the spoken text and the corresponding written text differ in expression manner, the substance content is the same, therefore, most of the characters in the translated spoken text and the corresponding written text are the same, and if there are almost no similar Chinese characters in the translated spoken text and the corresponding written text, that is, the similarity of the Chinese characters is lower than a preset similarity threshold, it indicates that the translation result is inaccurate.

Optionally, the preset screening rule includes sentence length screening, unregistered word screening and/or sentence frequency screening. Because the written text data is the text data which accords with the reading habits and is normalized by the user, in the embodiment of the disclosure, the written text data is a complete sentence with the expression specification, and is not a single vocabulary, and is not a disordered arrangement of a plurality of vocabularies, the obtained spoken text data with too long or too short sentences can be considered as wrong translation results and need to be deleted; due to the diversity of the written text data, if a plurality of different written text data acquire the same spoken text data, the deviation of the acquired text translation model is also shown, so that the spoken text data with sentence frequency (namely sentence occurrence frequency) higher than a preset frequency threshold value is determined as an incorrect translation result and needs to be deleted; the Out Of vocal is a word that is not included in the Vocabulary Of words to be segmented, and includes various proper nouns (e.g., names Of people, places, names Of businesses, etc.), abbreviations, newly added vocabularies, etc., and the unknown word should be segmented into independent vocabularies, but may be segmented into different vocabularies by mistake, resulting in a wrong word segmentation during sentence translation, so that the spoken text data containing the unknown word also needs to be deleted, and cannot be used as a training sample Of the speech recognition model.

S140, the spoken text data is used for recognizing the voice information.

After the spoken text data is obtained, the spoken voice data matched with the spoken text data can be obtained through a voice synthesis technology, and then the spoken text data and the spoken voice data form a voice training corpus pair for voice information identification.

Optionally, in an embodiment of the present disclosure, the using the spoken text data for recognition of voice information includes: and acquiring a voice recognition model according to the spoken text data, and recognizing the acquired voice information through the voice recognition model. The speech recognition model obtained by training the initial speech recognition model (for example, the initial speech recognition model based on the neural network) by using the speech training corpus pair consisting of the spoken text data and the spoken speech data has the recognition capability aiming at the spoken language scene.

Optionally, in an embodiment of the present disclosure, the obtaining a speech recognition model according to the spoken language text data includes: acquiring a spoken language voice recognition model according to the spoken language text data; and/or acquiring a universal speech recognition model according to the spoken text material and the written text material. According to the spoken language text data, besides a spoken language voice recognition model aiming at a spoken language scene can be trained and obtained, written voice data matched with the written text data can be obtained through a voice synthesis technology, a voice training corpus pair consisting of the spoken language text data and the spoken language voice data and a voice training corpus pair consisting of the written text data and the written voice data are both used as training samples, and the obtained spoken language recognition model has recognition capability aiming at the spoken language scene and the written language scene.

Example two

Fig. 2 is a block diagram of a structure of a speech information recognition apparatus provided in the second embodiment of the present disclosure, which specifically includes: a text corpus pair obtaining module 210, an association relation obtaining module 220, a spoken text material obtaining module 230, and a speech recognition executing module 240.

A text corpus pair obtaining module 210, configured to obtain at least one text corpus pair, where the text corpus pair includes a mapping relationship between a written text and a spoken text;

an association relation obtaining module 220, configured to determine an association relation between the written-to-surface text and the spoken-to-speech text;

a spoken text data obtaining module 230, configured to obtain written text data, and obtain spoken text data corresponding to the written text data according to the written text data and the association relationship;

and the voice recognition execution module 240 is configured to use the spoken text data for recognition of voice information.

Optionally, on the basis of the foregoing technical solution, the association obtaining module 220 is specifically configured to obtain a text translation model according to the at least one text corpus pair, where the text translation model represents an association relationship between the written text and the spoken text.

Optionally, on the basis of the foregoing technical solution, the spoken text data obtaining module 230 is specifically configured to input the written text data into the text translation model to obtain corresponding spoken text data.

Optionally, on the basis of the foregoing technical solution, the speech recognition execution module 240 is specifically configured to obtain a speech recognition model according to the spoken text data, and recognize the obtained speech information through the speech recognition model.

Optionally, on the basis of the above technical solution, the spoken text includes a read-back text, an inverted text, a mood word replacement text, a query mode conversion text, a retroactive text, and/or an incoherent text.

Optionally, on the basis of the foregoing technical solution, the association relationship obtaining module 220 is specifically configured to construct a neural network model based on a transform architecture, and perform text translation training on the neural network model based on the transform architecture through the at least one text corpus pair, so as to obtain a trained text translation model.

Optionally, on the basis of the above technical solution, the apparatus for recognizing voice information further includes:

and the standard spoken language text data acquisition module is used for screening the spoken language text data according to a preset screening rule so as to acquire the standard spoken language text data.

Optionally, on the basis of the above technical solution, the speech recognition execution module 240 is specifically configured to use the standard spoken language text material for recognition of speech information.

Optionally, on the basis of the above technical scheme, the preset screening rule includes a sentence length screening rule, an unregistered word screening rule, and/or a sentence frequency screening rule.

Optionally, on the basis of the foregoing technical solution, the speech recognition execution module 240 is specifically configured to obtain a spoken speech recognition model according to the spoken text data; and/or acquiring a universal speech recognition model according to the spoken text material and the written text material.

The device can execute the voice information recognition method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method. Technical details that are not elaborated in this embodiment may be referred to a method provided by any embodiment of the present disclosure.

EXAMPLE III

FIG. 3 illustrates a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 308, or installed from the ROM 302. The computer program, when executed by the processing device 301, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least one text corpus pair, wherein the text corpus pair comprises a mapping relation between a written text and a spoken text; determining the incidence relation between the book-surface text and the spoken text; acquiring written text data, and acquiring spoken text data corresponding to the written text data according to the written text data and the association relation; and using the spoken text material for voice information recognition.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not form a limitation on the module itself in some cases, for example, a standard spoken language text material obtaining module may be described as "a module for filtering spoken language text material according to a preset filtering rule to obtain standard spoken language text material". The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, [ example 1 ] there is provided a recognition method of voice information, including:

and using the spoken text material for voice information recognition.

In accordance with one or more embodiments of the present disclosure, [ example 2 ] there is provided the method of example 1, further comprising:

acquiring a text translation model according to the at least one text corpus pair, wherein the text translation model represents the incidence relation between the written text and the spoken text;

and inputting the written text data into the text translation model to obtain corresponding spoken text data.

In accordance with one or more embodiments of the present disclosure, [ example 3 ] there is provided the method of example 1, further comprising:

and acquiring a voice recognition model according to the spoken text data, and recognizing the acquired voice information through the voice recognition model.

In accordance with one or more embodiments of the present disclosure, [ example 4 ] there is provided the method of example 1, further comprising:

the spoken text includes read-back text, flip-chip text, inflection text, question mode conversion text, retroactive text, and/or incoherent text.

In accordance with one or more embodiments of the present disclosure, [ example 5 ] there is provided the method of example 2, further comprising:

and constructing a neural network model based on a Transformer architecture, and performing text translation training on the neural network model based on the Transformer architecture through the at least one text corpus pair to obtain a trained text translation model.

In accordance with one or more embodiments of the present disclosure, [ example 6 ] there is provided the method of example 2, further comprising:

screening the spoken text data according to a preset screening rule to obtain standard spoken text data;

and using the standard spoken text data for voice information recognition.

In accordance with one or more embodiments of the present disclosure, [ example 7 ] there is provided the method of example 6, further comprising:

the preset screening rules comprise sentence length screening, unregistered word screening and/or sentence frequency screening.

According to one or more embodiments of the present disclosure, [ example 8 ] there is provided the method of example 3, further comprising:

acquiring a spoken language voice recognition model according to the spoken language text data;

and/or acquiring a universal speech recognition model according to the spoken text material and the written text material.

According to one or more embodiments of the present disclosure, [ example 9 ] there is provided an apparatus for recognizing voice information, including:

According to one or more embodiments of the present disclosure, [ example 10 ] there is provided the apparatus of example 9, further comprising:

and the association relation acquisition module is specifically used for acquiring a text translation model according to the at least one text corpus pair, wherein the text translation model represents the association relation between the written text and the spoken text.

And the spoken language text material acquisition module is specifically used for inputting the written text material into the text translation model so as to acquire the corresponding spoken language text material.

According to one or more embodiments of the present disclosure, [ example 11 ] there is provided the apparatus of example 9, further comprising:

and the voice recognition execution module is specifically used for acquiring a voice recognition model according to the spoken text data and recognizing the acquired voice information through the voice recognition model.

According to one or more embodiments of the present disclosure, [ example 12 ] there is provided the apparatus of example 9, further comprising:

According to one or more embodiments of the present disclosure, [ example 13 ] there is provided the apparatus of example 10, further comprising:

and the incidence relation acquisition module is specifically used for constructing a neural network model based on a Transformer architecture, and performing text translation training on the neural network model based on the Transformer architecture through the at least one text corpus pair to acquire a trained text translation model.

According to one or more embodiments of the present disclosure, [ example 14 ] there is provided the apparatus of example 10, further comprising:

And the voice recognition execution module is specifically used for using the standard spoken language text data for voice information recognition.

According to one or more embodiments of the present disclosure, [ example 15 ] there is provided the apparatus of example 14, further comprising:

the preset screening rules comprise sentence length screening rules, unregistered word screening rules and/or sentence frequency screening rules.

According to one or more embodiments of the present disclosure, [ example 16 ] there is provided the apparatus of example 11, further comprising:

the speech recognition execution module is specifically used for acquiring a spoken speech recognition model according to the spoken text data; and/or acquiring a universal speech recognition model according to the spoken text material and the written text material.

According to one or more embodiments of the present disclosure, [ example 17 ] there is provided an electronic device comprising a memory, a processing means, and a computer program stored on the memory and executable on the processing means, the processing means implementing the method of recognizing speech information as in any of examples 1-8 when executing the program.

According to one or more embodiments of the present disclosure, [ example 18 ] there is provided a storage medium containing computer-executable instructions for performing the method of recognition of speech information as in any one of examples 1-8 when executed by a computer processor.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for recognizing speech information, comprising:

and using the spoken text material for voice information recognition.

2. The method of claim 1, wherein determining the association of the written text with the spoken text comprises: acquiring a text translation model according to the at least one text corpus pair, wherein the text translation model represents the incidence relation between the written text and the spoken text;

the obtaining of the spoken text data corresponding to the written text data according to the written text data and the association relationship includes: and inputting the written text data into the text translation model to obtain corresponding spoken text data.

3. The method of claim 1, wherein said using said spoken text material for speech information recognition comprises: and acquiring a voice recognition model according to the spoken text data, and recognizing the acquired voice information through the voice recognition model.

4. The method of claim 1, wherein the spoken text comprises read-back text, flip-chip text, inflection text, interrogative text, retrovocalized text, and/or incoherent text.

5. The method according to claim 2, wherein said obtaining a text translation model according to said at least one text corpus pair comprises:

6. The method of claim 2, after obtaining the corresponding spoken text material, further comprising:

the method for using the spoken text material for recognition of voice information comprises the following steps:

and using the standard spoken text data for voice information recognition.

7. The method of claim 6, wherein the preset filtering rules comprise sentence length filtering, unregistered word filtering and/or sentence frequency filtering.

8. The method of claim 3, wherein said obtaining a speech recognition model from said spoken text material comprises:

9. An apparatus for recognizing speech information, comprising:

10. An electronic device comprising a memory, processing means and a computer program stored on the memory and executable on the processing means, characterized in that the processing means, when executing the program, implements a method of recognition of speech information according to any one of claims 1 to 8.

11. A storage medium containing computer-executable instructions for performing the method of recognition of speech information according to any one of claims 1-8 when executed by a computer processor.