CN111128183B

CN111128183B - Speech recognition method, apparatus and medium

Info

Publication number: CN111128183B
Application number: CN201911319959.0A
Authority: CN
Inventors: 陈小敏; 张晶晶; 陈伟; 赵超; 王小川
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2023-03-17
Anticipated expiration: 2039-12-19
Also published as: CN111128183A; WO2021120690A1

Abstract

The embodiment of the invention provides a voice recognition method and device and a device for voice recognition, wherein the method specifically comprises the following steps: receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized; determining a second keyword related to the first keyword according to a knowledge graph; and decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path. The embodiment of the invention can improve the speech recognition accuracy rate corresponding to the keywords related to the application scene.

Description

Speech recognition method, apparatus and medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a speech recognition method and apparatus, and a machine-readable medium.

Background

The voice recognition technology can convert voice into corresponding characters or codes, and is widely applied to the fields of smart home, real-time voice transcription, machine simultaneous transmission and the like. The machine co-transmission is limited by a voice recognition technology and a machine translation technology, and the quality of the machine translation depends on the quality of a voice recognition text; therefore, in order to improve the accuracy of machine co-transmission, it is necessary to improve the quality of the speech recognition system. In some machine-to-machine application scenarios, the problem of recognition and translation of names of people, places, products or proper nouns is often encountered, and these words often play an important role in the field effect.

The current speech recognition model usually adopts a general acoustic model and a language model, and preferentially recognizes common words and words with high occurrence probability in linguistic data. For some specific application scenarios, in order to improve the accuracy of speech recognition, the model is often customized for the specific application scenario. The training of the customized model needs to obtain a large amount of relevant corpora of the application scene in advance, but for some scenes such as conferences, the specific content of the speech of participants cannot be obtained in advance, so that the customization cannot be carried out, and the accuracy of voice recognition under the specific application scene is low.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are provided to provide a speech recognition method, a speech recognition apparatus, and an apparatus for speech recognition that overcome or at least partially solve the above problems, and can improve the accuracy of speech recognition corresponding to a keyword related to an application scenario.

In order to solve the above problems, the present invention discloses a speech recognition method, comprising:

receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized;

determining a second keyword related to the first keyword according to a knowledge graph;

and decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

In another aspect, an embodiment of the present invention discloses a speech recognition apparatus, including:

the receiving module is used for receiving voice data to be recognized;

the first keyword determining module is used for determining a first keyword related to the voice data to be recognized;

the second keyword determining module is used for determining a second keyword related to the first keyword according to the knowledge graph;

and the decoding processing module is used for decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech recognition, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors comprise instructions for:

Embodiments of the invention also disclose one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the aforementioned methods.

The embodiment of the invention has the following advantages:

in the voice recognition process, the embodiment of the invention determines the first keyword related to the voice data to be recognized. The first keyword may reflect a characteristic of the application scenario, for example, the first keyword may be a speaker of the conference, a topic of the conference, or the like.

Further, according to the embodiment of the present invention, the second keyword related to the first keyword is determined according to the knowledge graph, and the keyword may be expanded on the basis of the first keyword, for example, the entity or concept corresponding to the speaker of the conference and the entity or concept corresponding to the topic of the conference are expanded by the keyword, so that the coverage of the keyword in the voice recognition process may be increased. Since the keywords such as the first keyword and the second keyword can reflect the characteristics of the application scene, the score of the decoding path is adjusted according to the keywords such as the first keyword and the second keyword, the score of the decoding path where the keywords such as the first keyword and the second keyword are located can be improved, and the speech recognition accuracy rate corresponding to the keywords can be further improved.

Drawings

FIG. 1 is a flow diagram illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of the steps of a first embodiment of a speech recognition method of the present invention;

FIG. 3 is a schematic of a knowledge-graph of an embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps of a second embodiment of a speech recognition method according to the present invention;

FIG. 5 is a flowchart illustrating the steps of a third embodiment of a speech recognition method according to the present invention;

fig. 6 is a block diagram showing a configuration of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 7 is a block diagram of an apparatus 900 for speech recognition according to an embodiment of the present invention; and

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The embodiment of the invention can be applied to a voice recognition scene. Speech recognition scenarios for converting speech to text may include: a voice input scenario, an intelligent chat scenario, a voice translation scenario, etc.

In a speech recognition scenario, an acoustic model adopts a deep neural network model to model a mapping relation between an acoustic pronunciation and a basic acoustic unit (generally a phoneme); phonemes are the smallest units of speech that are divided according to the natural properties of the speech. The acoustic model can receive input speech features and output a phoneme sequence corresponding to the speech features.

Referring to fig. 1, which shows a schematic diagram of a flow of a speech recognition method according to an embodiment of the present invention, a model used in the speech recognition method may include: acoustic models, language models, and decoders.

The determining process of the acoustic model may include: and performing feature extraction on voice corpora in the voice database, and performing acoustic model training according to the extracted voice features.

The determining process of the language model may include: and training a language model according to the text corpora in the text database.

The decoder is used for finding the best decoding path under the condition of a given phoneme sequence, and then a speech recognition result can be obtained.

The speech recognition process shown in fig. 1 may include: and performing feature extraction on the input voice to obtain voice features, and inputting the voice features into a decoder. Firstly, determining a phoneme sequence corresponding to the voice characteristics by a decoder by using an acoustic model; and then, carrying out voice decoding on the phoneme sequence according to the language model to obtain a voice recognition result, and outputting a text corresponding to the voice recognition result.

The acoustic model may include: a neural network model and a hidden markov model, wherein the neural network model may provide acoustic modeling units to the hidden markov model, and the granularity of the acoustic modeling units may include: words, syllables, phones, or states, etc.; the hidden Markov model can determine the phoneme sequence according to an acoustic modeling unit provided by the neural network model. A state mathematically characterizes the state of a markov process.

And searching the optimal decoding path in a search space consisting of knowledge sources such as an acoustic model, a dictionary and a language model by a decoder in the voice recognition process according to the voice data to be recognized, and obtaining a voice recognition result according to a word sequence corresponding to the optimal decoding path.

During speech recognition, problems of homophonic, allopathic, or similar-tone words often occur. For example, the syllable "luozhenyu" can correspond to words such as "Louis" and "Louis"; the syllable "luojisiwei" can correspond to the words of "logical thinking", "lokul thinking", and the like; the syllable "tongcheng" can correspond to "same city" and "tung city"; the syllable "dedao" corresponds to the words "get", "get track", "De track", etc.

The existing decoder usually adopts a general acoustic model and a language model, and preferentially identifies words with higher occurrence probability in common words and linguistic data, however, the words with higher occurrence probability in the common words and the linguistic data may not be suitable for specific application scenarios. The training of the customized model needs to obtain a large amount of relevant corpora of the application scene in advance, but for some scenes such as conferences, the specific content of the speech of participants cannot be obtained in advance, so that the customization cannot be carried out, and the accuracy rate of voice recognition under the specific application scene is low.

Aiming at the technical problem of low accuracy of voice recognition in a specific application scene, the embodiment of the invention provides a voice recognition scheme, which specifically comprises the following steps: receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized; determining a second keyword related to the first keyword according to the knowledge graph; and decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the subsequent voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

The embodiment of the invention determines the first keyword related to the voice data to be recognized, determines the second keyword related to the first keyword according to the knowledge graph, and can expand the keywords on the basis of the first keyword.

A Knowledge Graph (Knowledge Graph) is a Knowledge base called semantic network (semantic network), i.e. a Knowledge base with a directed Graph structure. Wherein nodes of the graph represent entities (entries) or concepts (concepts), and edges of the graph represent various semantic relationships between entities/concepts, with different nodes in the graph being connected by semantic relationships.

An entity may be represented by several attributes, such as a person, such as the attributes of a birthday, a height, a wife, and so on. The movie entities include director, actors, country of production, date of showing, etc.

The association relationship between different entities can be established through the attributes of the entities, for example:

liu de hua (entity) - -, wife (attribute) - - > mercurous chloride (another entity);

liu De Hua- -movie works- > without interchannel;

indifferent-tabletting country/region- > hong kong of china;

person (entity) -birth place (attribute) - > place name (another entity);

person (entity) - -occupation (attribute) - > occupation name (another entity);

person (entity) - - > work (attribute) representation > application name (another entity).

In the voice recognition process, the embodiment of the invention determines the first key word related to the voice data to be recognized. The first keyword may reflect a characteristic of the application scenario, for example, the first keyword may be a speaker of the conference, a topic of the conference, and the like. Further, according to the embodiment of the invention, the second keyword related to the first keyword is determined according to the knowledge graph, and the keyword expansion can be performed on the basis of the first keyword. For example, the entity or concept corresponding to the speaker of the conference and the entity or concept corresponding to the topic of the conference are extended by keywords, so that the coverage of the keywords in the voice recognition process can be increased. Since the keywords such as the first keyword and the second keyword can both reflect the characteristics of the application scene, the score of the decoding path is adjusted according to the keywords such as the first keyword and the second keyword, the score of the decoding path where the keywords such as the first keyword and the second keyword are located can be improved, and the speech recognition accuracy corresponding to the keywords can be further improved.

For example, the speech recognition scenarios are: in a conference scenario of machine co-transmission, the first keyword determined in the embodiment of the present invention may include: the speaker "Louisugan" in the conference can obtain the second keyword related to the "Louisugan" according to the knowledge graph, and the second keyword can be directly or indirectly connected with the first keyword "Louisugan" in the knowledge graph.

For example, the first keyword "Loyasu" is connected with the following second keyword through entity attributes:

birth land: anhui tung city;

occupation: the Rough thinking founder;

a representative work: thus obtaining the APP.

Under the condition of obtaining the first keyword and the second keyword, if the voice data to be recognized input by the speaker of the conference comprises the acoustic characteristics corresponding to the Luo ji si wei, the embodiment of the invention can increase the score of the decoding path corresponding to the Rodite thinking; or, if the to-be-recognized voice data input by the speaker of the conference includes the acoustic feature corresponding to "dedao", the embodiment of the present invention may increase the score of "obtaining" the corresponding decoding path; or, if the voice data to be recognized input by the speaker of the conference includes the acoustic feature corresponding to "the citizen friendly of tong cheng", the embodiment of the present invention may increase the score of the decoding path corresponding to "tung city". Therefore, the embodiment of the invention can improve the accuracy of speech recognition corresponding to the keywords relevant to the application scene.

The voice recognition method provided by the embodiment of the invention can be applied to the application environments of the client and the server, the client and the server are positioned in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.

Alternatively, the client may run on the terminal, for example, the client may be an APP (Application program) running on the terminal, such as a voice transcription APP, or a voice translation APP, or an intelligent interaction APP.

Taking a voice transcription APP as an example, the client can collect voice data to be recognized and send the voice data to be recognized to the server, and the server can process the voice data to be recognized and return a voice recognition result to the client by using the scheme of the embodiment of the invention.

Taking a voice translation APP as an example, a client can acquire voice data to be recognized and send the voice data to be recognized to a server, and the server can process the voice data to be recognized and perform machine translation on an obtained voice recognition result by using the scheme of the embodiment of the invention to obtain a machine translation result, and return the machine translation result to the client.

Optionally, the terminal may include: conference terminals, smart phones, tablet computers, e-book readers, MP3 (moving Picture Experts Group Audio Layer III) players, MP4 (moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, smart speakers, and the like. It is understood that the embodiment of the present invention does not limit the specific terminal.

Method embodiment 1

Referring to fig. 2, a flowchart illustrating steps of a first embodiment of a speech recognition method according to the present invention is shown, which may specifically include the following steps:

step 201, receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized;

step 202, determining a second keyword related to the first keyword according to a knowledge graph;

step 203, decoding the voice data to be recognized, adjusting the score of the decoding path corresponding to the subsequent voice data to be recognized according to the first keyword and the second keyword, and determining the voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

At least one step included in the method shown in fig. 1 may be executed by the client and/or the server, and it is understood that the embodiment of the present invention does not limit a specific execution subject of the method including the step.

In step 201, the determining manner used for determining the first keyword related to the voice data to be recognized may specifically include:

determining a mode 1, acquiring a text material related to the voice data to be recognized, and extracting a first keyword from the text material; or

Determining a mode 2, and performing face recognition on the image corresponding to the speaker to obtain a first keyword corresponding to the speaker.

The determination mode 1 may extract the first keyword from some text materials played synchronously at the conference site.

Textual material associated with the speech data to be recognized may include, but is not limited to, any one or more of: promotional material related to the speaking site, presentation material displayed at the speaking site.

Wherein promotional material related to the speaking scene is available prior to or during the meeting.

The text material can be in various forms, such as a picture + text, a video + text and the like, and the text format can be a PPT format, a WORD format and the like.

Alternatively, the text material may be obtained by performing OCR (Optical Character Recognition) on an image corresponding to the presentation.

The text material may be obtained by first obtaining an image related to the voice data through an image capturing device, such as a camera, and then recognizing a text in the image through an OCR technology, so as to obtain the text material related to the voice data. Of course, the image corresponding to the presentation can also be obtained in a screen capture mode.

In step 201, a category of the first keyword may be set, for example, the category may be an entity word and/or a term. Of course, according to the characteristics of the speech data to be recognized in the application scenario, the first keyword may also include some other types of words, which is not limited in the embodiment of the present invention.

Optionally, the extracting the first keyword from the text material may specifically include: and carrying out named entity recognition on the text material to obtain entity words in the text material.

A named entity refers to an entity in the text that has a particular meaning, such as a person's name, place name, organization name, proper noun, and the like. The named entity identification method can comprise the following steps: a rule and dictionary based approach, a statistical based approach, or a neural network based approach, etc.

The determining method 2 may collect an image corresponding to the speaker, and perform face recognition on the image corresponding to the speaker to obtain identity information corresponding to the speaker.

For example, a face image of a known user may be stored in a database for face recognition, so that identity information corresponding to a speaker may be recognized. The identity information may include: name, etc.

For another example, image search may be performed according to an image corresponding to a speaker to obtain user information corresponding to the image, and the first keyword may be determined according to the user information. Optionally, a mapping relationship between the image and the user information may be stored in the database of the image search, and the image corresponding to the presenter may be matched with the image in the database of the image search, so as to obtain the user information corresponding to the presenter.

In step 202, in the knowledge-graph, an entity may be used to express a node in the graph and a relationship may be used to express an edge in the graph. An entity refers to things in the real world such as people, place names, concepts, medicines, companies, etc.; relationships are used to express some kind of connection between different entities, such as people- "live in" -Beijing, zhang three and Li four are "friends", logistic regression is a deep learning "leading knowledge", and so on.

The knowledge graph supports the learning function according to the interactive actions such as reasoning, error correction, marking and the like by utilizing the interactive machine learning technology, continuously deposits knowledge logic and models, and can improve the accuracy, the breadth and the coverage in the knowledge graph.

Optionally, the inference of the knowledge graph may include: giving a relationship to judge whether the relationship exists between any two entities; or, a new relationship is excavated, etc.; or, to mine entities related to a given entity, etc. For example, inference can be utilized to mine entities related to the "expert system," such as "supervised learning," "reinforcement learning," "bayesian classifier," "covariance matrix," and the like.

Alternatively, the knowledge-graph may be error corrected using rules. For example, the knowledge-graph may be error corrected based on reciprocal properties of the relationships. For example, if the relationship a of the entity 1 is the entity 2, the relationship a' of the entity 2 is the entity 3, and if the entity 1 does not match the entity 3, it can be considered that there is an error in the entity 1 or the entity 3. Wherein, the relationship A and the relationship A ' are reversible relationships, such as the relationship A being a wife, the relationship A ' being a husband, the relationship A being a parent company, the relationship A ' being a subsidiary company, and the like.

Referring to FIG. 3, a schematic of a knowledge graph of an embodiment of the present invention is shown, wherein "Louis" is connected by entity attributes to the following entities:

birth land: anhui tung city;

occupation: the Rough thinking founder;

a representative work: obtaining APP;

friend: XXX;

the contact way is as follows: a mobile phone number;

participating in a program: a director of the integrated art program, etc.

It is understood that the entity attributes in fig. 3 are only used as an alternative embodiment, and in fact, the embodiment of the present invention does not limit the specific entity attributes. In addition, it is understood that in fig. 3, entities such as "tung city", "luo-lei-thought", etc. may be connected with other entities except "luo-sheng", and the embodiments of the present invention do not limit specific nodes in the knowledge graph and specific connections between nodes.

In an optional embodiment of the present invention, the determining the second keyword related to the first keyword may specifically include: and determining a second keyword related to the first keyword according to the position of the first keyword in the knowledge graph. The location of the first keyword in the knowledge-graph may refer to a location in the knowledge-graph where an entity or concept matching the first keyword is located.

According to an embodiment, if the first keyword corresponds to a first node in the knowledge graph and the first node is an initial node, a second keyword is obtained according to all nodes of the knowledge graph. The originating node may point to a non-originating node and may not be pointed to by other nodes. The starting node may be a root node of the tree-shaped knowledge graph; or the starting node may be the central node of the non-tree-shaped knowledge-graph, as in fig. 3, the node "raleigh" is at the center of the graph, and the arrow points outward.

According to another embodiment, if the first keyword corresponds to a second node in the knowledge graph and the second node is a non-initial node, the second keyword is obtained according to the second node and a node under the second node. The nodes under the second node may include: a third node to which the second node points, a fourth node to which the third node points, or a fifth node to which the fourth node points, etc.

In practical application, a keyword corresponding to a third node pointed by the second node may be first used as the second keyword. Whether the keyword corresponding to the fourth node is used as the second keyword can be determined according to the correlation between the keyword corresponding to the fourth node and the first keyword. For example, if the relevance exceeds the relevance threshold, the fourth node corresponding keyword is set as the second keyword. Similarly, whether the keyword corresponding to the fifth node is used as the second keyword may be determined according to the correlation between the keyword corresponding to the fifth node and the first keyword.

In step 203, the voice data to be recognized may be decoded by a decoder. Optionally, the decoder may determine a phoneme sequence corresponding to the speech feature of the speech data to be recognized by using the acoustic model; and then, performing speech decoding on the phoneme sequence according to the language model, wherein the speech decoding is used for finding the optimal decoding path under the condition of giving the phoneme sequence, and further obtaining a speech recognition result.

In the embodiment of the present invention, the language model and the acoustic model in the decoding process may adopt a general language model and an acoustic model to obtain the decoding path corresponding to the voice data to be recognized.

Further, the score of the decoding path corresponding to the subsequent voice data to be recognized may be adjusted according to the first keyword and the second keyword. Specifically, the score of the decoding path passing through the first keyword and the second keyword may be increased.

According to an embodiment, the increased score corresponding to the decoding path where the first keyword or the second keyword is located may be determined according to information such as the number of occurrences or the occurrence ratio of the first keyword or the second keyword in the text material. For example, the more occurrences, the more the score is increased.

According to another embodiment, the increased score corresponding to the decoding path of the second keyword can be determined according to the correlation between the first keyword and the second keyword. For example, the higher the correlation, the more the score is increased.

Through the adjustment of the score of the decoding path, the score of the decoding path where the first keyword or the second keyword is located is improved, the selection probability of the decoding path where the first keyword or the second keyword is located can be increased, some keywords in the voice recognition result obtained through decoding are matched with the first keyword or the second keyword, and the accuracy of voice recognition is improved.

In summary, in the speech recognition method according to the embodiment of the present invention, in the speech recognition process, the text material related to the speech data to be recognized is obtained, and the first keyword is extracted from the text material. Due to the text material and the voice data to be recognized, the first keyword extracted from the text material may reflect the characteristics of the application scene, for example, the first keyword may be a speaker of the conference, a conference subject, and the like.

Further, the embodiment of the present invention determines the second keyword related to the first keyword according to the knowledge graph, and the keyword expansion can be performed on the basis of the first keyword. For example, the entity or concept corresponding to the speaker of the conference and the entity or concept corresponding to the topic of the conference are extended by keywords, so that the coverage of the keywords in the voice recognition process can be increased. Since the keywords such as the first keyword and the second keyword can reflect the characteristics of the application scene, the score of the decoding path is adjusted according to the keywords such as the first keyword and the second keyword, the score of the decoding path where the keywords such as the first keyword and the second keyword are located can be improved, and the speech recognition accuracy rate corresponding to the keywords can be further improved.

Method embodiment two

Referring to fig. 4, a flowchart illustrating steps of a second embodiment of the speech recognition method of the present invention is shown, which may specifically include the following steps:

step 401, receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized;

step 402, determining a second keyword related to the first keyword according to a knowledge graph;

step 403, determining a third keyword related to the first keyword according to a search engine;

step 404, decoding the voice data to be recognized, adjusting the score of the decoding path corresponding to the subsequent voice data to be recognized according to the first keyword, the second keyword and the third keyword, and determining the voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

Compared with the first method embodiment shown in fig. 2, in the present embodiment, based on the first keyword, the keyword is expanded through the search based on the relevance, so that the coverage of the keyword in the voice recognition process can be increased.

According to one embodiment, a search may be performed in a database of a search engine based on the first keyword, and the third keyword may be extracted from the search result.

For example, the search result is a web page, the candidate keywords may be extracted from the title of the web page, and the candidate keywords may be filtered according to the correlation between the candidate keywords and the first keyword.

For another example, if the search result is a web page, the TF-IDF (term frequency-inverse file frequency) method may be used to extract the alternative keywords from the web page, and filter the alternative keywords according to the correlation between the alternative keywords and the first keyword.

Optionally, a relevance between the first keyword and the third keyword exceeds a relevance threshold. The correlation threshold can be determined by one skilled in the art according to the actual application requirements, and the embodiment of the present invention does not limit the specific correlation threshold.

According to another embodiment, the third keyword related to the first keyword may be obtained from a log of a search engine. Wherein the third keyword may satisfy the following condition: the search time of the third keyword is adjacent to the search time of the first keyword, and the search time of the third keyword is after the search time of the first keyword. For example, after inputting the first keyword "actor name a" to search, the user inputs "actor name B" and "movie title" in sequence to search, and it is considered that "actor name a" and "actor name B" and "movie title" are related searches, so that "actor name B" and "movie title" can be used as the second keyword corresponding to "actor name a".

In step 404, the score of the decoding path where the third keyword is located may be increased. For the process of increasing the score of the decoding path where the third keyword is located, since the process is similar to the process of increasing the score of the decoding path where the first keyword or the second keyword is located, the process is not repeated herein and may refer to each other.

Method embodiment three

Referring to fig. 5, a flowchart illustrating steps of a third embodiment of a speech recognition method according to the present invention is shown, which may specifically include the following steps:

step 501, receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized;

step 502, determining a second keyword related to the first keyword according to a knowledge graph;

step 503, decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the subsequent voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path;

with respect to the first embodiment of the method shown in fig. 2, the method of this embodiment may further include:

step 504, displaying the voice recognition result, and receiving correction information of the voice recognition result;

and 505, adjusting the score of the decoding path corresponding to the subsequent voice data to be recognized according to the correction information.

Compared with the first method embodiment shown in fig. 2, the embodiment of the present invention may also present the voice recognition result, so that the user generates the correction information for the voice recognition result. The user may be an on-site user, or an off-site user.

Further, according to the correction information, the embodiment of the present invention may further adjust the score of the decoding path corresponding to the subsequent voice data to be recognized, so as to further improve the voice recognition accuracy corresponding to the subsequent voice data to be recognized.

For example, when the previous voice data to be recognized includes an entity word such as a person name, a place name, a product name, etc., if the corresponding voice recognition result a includes an error for the entity word, the user generates correction information for the voice recognition result, such as modifying "person name a" in the voice recognition result a to "person name B", modifying "place name a" in the voice recognition result a to "place name B", or modifying "non-entity word" in the voice recognition result a to "entity word", etc. The word corresponding to the correction information may be used as the fourth keyword, and the fourth keyword is used in the process of adjusting the score of the decoding path corresponding to the subsequent voice data to be recognized, for example, the score of the decoding path where the fourth keyword is located is increased.

The method can receive the correction of the real-time property of the voice recognition result, thereby adjusting the score of the decoding path of the subsequent voice data to be recognized according to the correction information, and further improving the recognition effect of the subsequent voice data.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 6, a block diagram of a speech recognition apparatus according to an embodiment of the present invention is shown, which may specifically include:

a receiving module 601, configured to receive voice data to be recognized;

a first keyword determining module 602, configured to determine a first keyword related to the to-be-recognized voice data;

a second keyword determining module 603, configured to determine, according to a knowledge graph, a second keyword related to the first keyword;

the decoding processing module 604 is configured to decode the voice data to be recognized, adjust a score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, and determine a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

Optionally, the first keyword determination module 602 may include:

the text extraction module is used for acquiring text materials related to the voice data to be recognized and extracting first keywords from the text materials; or alternatively

And the image identification module is used for carrying out face identification on the image corresponding to the speaker so as to obtain a first keyword corresponding to the speaker.

Optionally, the text material is obtained by performing optical character recognition on an image corresponding to the presentation.

Optionally, the second keyword determination module 603 may include:

and the map processing module is used for determining a second keyword related to the first keyword according to the position of the first keyword in the knowledge map.

Optionally, the map processing module may include:

the first map processing module is used for obtaining a second keyword according to all nodes of the knowledge map if the first keyword corresponds to a first node in the knowledge map and the first node is an initial node; or

And the second map processing module is used for obtaining a second keyword according to the second node and nodes subordinate to the second node if the first keyword corresponds to the second node in the knowledge map and the second node is a non-initial node.

Optionally, the text extraction module may include:

and the entity recognition module is used for carrying out named entity recognition on the text material so as to obtain entity words in the text material.

Optionally, the apparatus may further include:

the third keyword determining module is used for determining a third keyword related to the first keyword according to a search engine;

the decoding processing module may include:

and the first adjusting module is used for adjusting the score of the decoding path corresponding to the voice data to be recognized according to the first keyword, the second keyword and the third keyword.

Optionally, the apparatus may further include:

the display module is used for displaying the voice recognition result and receiving correction information of the voice recognition result;

and the second adjusting module is used for adjusting the score of the decoding path corresponding to the subsequent voice data to be recognized according to the correction information.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 7 is a block diagram illustrating an apparatus 900 for speech recognition according to an example embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, the apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 may include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a record mode, and a voice recognition mode. The received audio signal may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 further includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing various aspects of state assessment for the device 900. For example, the sensor component 914 may detect an open/closed state of the device 900, the relative positioning of components, such as a display and keypad of the apparatus 900, the sensor component 914 may also detect a change in position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a smart terminal, enable the smart terminal to perform a speech recognition method, the method comprising: receiving voice data to be recognized, and determining a first keyword related to the voice data to be recognized; determining a second keyword related to the first keyword according to a knowledge graph; and decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path.

Fig. 8 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Further, a central processor 1922 may be arranged to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The embodiment of the invention discloses A1, a voice recognition method, comprising the following steps:

A2, according to the method in A1, the determining a first keyword related to the voice data to be recognized includes:

acquiring text materials related to the voice data to be recognized, and extracting first keywords from the text materials; or alternatively

And carrying out face recognition on the image corresponding to the speaker to obtain a first keyword corresponding to the speaker.

And A3, according to the method of A2, the text material is obtained by carrying out optical character recognition on the image corresponding to the presentation.

A4, according to the method of any one of A1 to A3, the determining a second keyword related to the first keyword includes:

and determining a second keyword related to the first keyword according to the position of the first keyword in the knowledge graph.

A5, according to the method in A4, the determining a second keyword related to the first keyword includes:

if the first keyword corresponds to a first node in the knowledge graph and the first node is a starting node, obtaining a second keyword according to all nodes of the knowledge graph; or

And if the first keyword corresponds to a second node in the knowledge graph and the second node is a non-initial node, obtaining a second keyword according to the second node and nodes subordinate to the second node.

A6, according to the method in A2, the extracting the first keyword from the text material includes:

and carrying out named entity recognition on the text material to obtain entity words in the text material.

A7, the method according to any of A1 to A3, further comprising:

determining a third keyword related to the first keyword according to a search engine;

then, the adjusting the score of the decoding path corresponding to the voice data to be recognized includes:

and adjusting the score of the decoding path corresponding to the voice data to be recognized according to the first keyword, the second keyword and the third keyword.

A8, the method of any one of A1 to A3, further comprising:

displaying the voice recognition result, and receiving correction information of the voice recognition result;

and adjusting the score of the decoding path corresponding to the subsequent voice data to be recognized according to the correction information.

The embodiment of the invention discloses B9, a voice recognition device, comprising:

the receiving module is used for receiving voice data to be recognized;

B10, according to the device of B9, the first keyword determining module comprises:

the text extraction module is used for acquiring text materials related to the voice data to be recognized and extracting first keywords from the text materials; or

And B11, according to the device in B10, the text material is obtained by carrying out optical character recognition on the image corresponding to the presentation.

B12, according to any one of the apparatuses B9 to B11, the second keyword determination module includes:

B13, the device according to B12, the map processing module comprises:

the first map processing module is used for obtaining a second keyword according to all nodes of the knowledge map if the first keyword corresponds to a first node in the knowledge map and the first node is an initial node; or alternatively

B14, according to the device of B10, the text extraction module comprises:

B15, the apparatus according to any of B9 to B11, further comprising:

the decoding processing module comprises:

B16, the apparatus according to any of B9 to B11, further comprising:

The embodiment of the invention discloses C17, a device for speech recognition, comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprise instructions for:

C18, the determining the first keyword related to the to-be-recognized speech data according to the apparatus of C17 includes:

acquiring text materials related to the voice data to be recognized, and extracting first keywords from the text materials; or

And C19, according to the device in the C18, the text material is obtained by performing optical character recognition on the image corresponding to the presentation.

C20, according to the apparatus in any one of C17 to C19, the determining a second keyword related to the first keyword includes:

C21, the determining a second keyword related to the first keyword according to the apparatus of C4 includes:

if the first keyword corresponds to a first node in the knowledge graph and the first node is an initial node, obtaining a second keyword according to all nodes of the knowledge graph; or

C22, the device according to C2, wherein the extracting of the first keyword from the text material comprises:

C23, the device of any of C17-C19, the device also configured to execute the one or more programs by the one or more processors including instructions for:

C24, the device of any of C17-C19, the device also configured to execute the one or more programs by the one or more processors including instructions for:

Embodiments of the present invention disclose D25, one or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a method as described in one or more of A1 through A8.

The foregoing has described in detail a speech recognition method, a speech recognition apparatus and a speech recognition apparatus provided by the present invention, and the present invention has been described in detail by applying specific examples to explain the principles and embodiments of the present invention, and the description of the above examples is only used to help understand the method and core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A speech recognition method is applied to a machine simultaneous transmission scene and comprises the following steps:

determining a second keyword related to the first keyword according to a knowledge graph; the node corresponding to the first keyword points to the node corresponding to the second keyword in the knowledge graph;

decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, and determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path; the adjustment is used for improving the score of the decoding path where the first keyword or the second keyword is located; determining an increased score corresponding to a decoding path where the first keyword or the second keyword is located according to the occurrence frequency or the occurrence ratio of the first keyword or the second keyword in the text material; or determining an increased score corresponding to the decoding path of the second keyword according to the correlation between the first keyword and the second keyword;

and displaying the voice recognition result.

2. The method of claim 1, wherein determining the first keyword related to the speech data to be recognized comprises:

3. The method of claim 2, wherein the textual material is obtained by optical character recognition of an image corresponding to the presentation.

4. The method of any of claims 1 to 3, wherein determining a second keyword related to the first keyword comprises:

5. The method of claim 4, wherein determining the second keyword related to the first keyword comprises:

if the first keyword corresponds to a first node in the knowledge graph and the first node is a starting node, obtaining a second keyword according to all nodes of the knowledge graph; or alternatively

6. The method of claim 2, wherein extracting the first keyword from the textual material comprises:

7. The method according to any one of claims 1 to 3, further comprising:

8. The method according to any one of claims 1 to 3, further comprising:

9. A speech recognition apparatus, comprising:

the receiving module is used for receiving voice data to be recognized;

the decoding processing module is used for decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path, and displaying the voice recognition result; the adjustment is used for improving the score of a decoding path where the first keyword or the second keyword is located; determining an increasing score corresponding to a decoding path where the first keyword or the second keyword is located according to the occurrence frequency or the occurrence ratio of the first keyword or the second keyword in the text material; or determining an increased score corresponding to the decoding path of the second keyword according to the correlation between the first keyword and the second keyword.

10. The apparatus of claim 9, wherein the first keyword determination module comprises:

And the image recognition module is used for carrying out face recognition on the image corresponding to the speaker so as to obtain a first keyword corresponding to the speaker.

11. The apparatus of claim 10, wherein the text material is obtained by optical character recognition of an image corresponding to the presentation.

12. The apparatus of any of claims 9 to 11, wherein the second keyword determination module comprises:

13. The apparatus of claim 12, wherein the atlas handling module comprises:

14. The apparatus of claim 10, wherein the text extraction module comprises:

15. The apparatus of any of claims 9 to 11, further comprising:

the decoding processing module comprises:

16. The apparatus of any of claims 9 to 11, further comprising:

17. An apparatus for speech recognition comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for:

decoding the voice data to be recognized, adjusting the score of a decoding path corresponding to the voice data to be recognized according to the first keyword and the second keyword, determining a voice recognition result corresponding to the voice data to be recognized according to the adjusted score of the decoding path, and displaying the voice recognition result; the adjustment is used for improving the score of the decoding path where the first keyword or the second keyword is located; determining an increased score corresponding to a decoding path where the first keyword or the second keyword is located according to the occurrence frequency or the occurrence ratio of the first keyword or the second keyword in the text material; or determining an increased score corresponding to the decoding path of the second keyword according to the correlation between the first keyword and the second keyword.

18. The apparatus of claim 17, wherein the determining a first keyword related to the speech data to be recognized comprises:

19. The apparatus of claim 18, wherein the textual material is derived from optical character recognition of an image corresponding to the presentation.

20. The apparatus according to any one of claims 17 to 19, wherein said determining a second keyword related to the first keyword comprises:

21. The apparatus of claim 20, wherein determining the second keyword related to the first keyword comprises:

if the first keyword corresponds to a first node in the knowledge graph and the first node is an initial node, obtaining a second keyword according to all nodes of the knowledge graph; or alternatively

22. The apparatus of claim 18, wherein the extracting the first keyword from the text material comprises:

23. The apparatus of any of claims 17-19, wherein the apparatus is further configured to execute the one or more programs by the one or more processors includes instructions for:

24. The apparatus of any of claims 17-19, wherein the apparatus is also configured to execute the one or more programs by one or more processors includes instructions for:

25. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1-8.