CN111276149A

CN111276149A - Voice recognition method, device, equipment and readable storage medium

Info

Publication number: CN111276149A
Application number: CN202010058833.9A
Authority: CN
Inventors: 吴嘉嘉; 殷兵; 胡金水; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2020-06-12
Anticipated expiration: 2040-01-19
Also published as: CN111276149B

Abstract

The application discloses a voice recognition method, a device, equipment and a readable storage medium, and the method comprises the steps of obtaining material data related to voice to be recognized; determining content information contained in the material data, and determining a preliminary reference text corresponding to the material data at least based on semantic features of the content information and visual expression features of the content information in the material data; determining a reference text set corresponding to the voice to be recognized based on the preliminary reference text; the speech recognition method comprises the steps of combining the reference text set to perform speech recognition on the speech to be recognized to obtain a speech recognition result, determining the reference text set capable of assisting the speech recognition by means of material data related to the speech to be recognized, and therefore obtaining text information such as professional terms in advance as prior information to assist in recognizing the speech to be recognized, and obviously, accuracy of the speech recognition result can be greatly improved.

Description

Voice recognition method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and readable storage medium.

Background

Speech recognition is the process of converting speech into text. In recent years, with the maturity of voice recognition technology, the voice recognition technology has been successfully applied to various industries, especially for the voice recognition technology in specific fields, for example, voice recognition is performed on the recorded data of a conference, and the received voice data is directly converted into the text content of the conference, which greatly facilitates the conference secretary to arrange the conference summary.

However, in some cases, speech recognition often encounters some unusual words, such as some professional words, and the accuracy of speech recognition is low.

Disclosure of Invention

In view of the above, the present application is made to provide a voice recognition method, apparatus, device, and readable storage medium. The specific scheme is as follows:

a speech recognition method comprising:

acquiring material data related to the voice to be recognized;

determining content information contained in the material data, and determining a preliminary reference text corresponding to the material data at least based on semantic features of the content information and visual expression features of the content information in the material data;

determining a reference text set corresponding to the voice to be recognized based on the preliminary reference text;

and carrying out voice recognition on the voice to be recognized by combining the reference text set to obtain a voice recognition result.

Preferably, the determining a preliminary reference text corresponding to the material data based on at least the semantic features of the content information and the visual representation features of the content information in the material data includes:

determining a preliminary reference text corresponding to the material data based on the semantic features of the content information and the visual expression features of the content information in the material data;

or the like, or, alternatively,

and determining a preliminary reference text corresponding to the material data based on the semantic features of the content information, the visual representation features of the content information in the material data and the attribute features of the material data.

Preferably, the first and second electrodes are formed of a metal,

the visual representation characteristics of the content information in the material data include any one or combination of more of the following:

the format, layout and position of the content information in the material data;

the attribute characteristics of the material data include any one or a combination of:

type of material data, style of material data, and authorship information of material data.

Preferably, the determining a preliminary reference text corresponding to the material data based on the semantic features of the content information and the visual representation features of the content information in the material data includes:

determining semantic features of the content information and visual representation features of the content information in the material data;

inputting the semantic features and the visual expression features into a configured key information determination model to obtain key information output by the model, wherein the key information is used as a preliminary reference text corresponding to the material data, and the method comprises the following steps:

the key information determination model is obtained by taking semantic features of content information contained in training data and visual expression features of the content information in the training data as training samples and taking labeled key information corresponding to the training data as sample labels for training.

Preferably, the process of determining semantic features of the content information comprises:

if the content information is text information, performing word segmentation on the text information, and determining a word vector of each word segmentation as a semantic feature of the word segmentation;

and if the content information is multimedia information, searching relevant text information by taking the multimedia information as a searching condition, segmenting the relevant text information, and determining a word vector of each segmented word as the semantic feature of the segmented word.

Preferably, the process of determining the visual representation characteristics of the content information in the material data comprises:

if the content information is text information, performing word segmentation on the text information, determining an image area where each word is located in the material data, and determining the visual expression characteristic of each word based on the image area;

and if the content information is multimedia information, determining an image area of the multimedia information in the material data, and determining the visual expression characteristics of the multimedia information based on the image area.

Preferably, the key information determination model comprises a feature splicing layer and a classification discrimination layer;

the step of inputting the semantic features and the visual expression features into a configured key information determination model to obtain key information output by the model comprises the following steps:

splicing the semantic features and the visual expression features by using the feature splicing layer to obtain splicing features;

and judging whether the corresponding content information is key information or not based on the splicing characteristics by utilizing the classification judging layer, and outputting the judged key information.

Preferably, the determining a preliminary reference text corresponding to the material data based on the semantic features of the content information, the visual representation features of the content information in the material data, and the attribute features of the material data includes:

determining semantic features of the content information, visual representation features of the content information in the material data and attribute features of the material data;

determining a model by inputting the semantic features, the visual expression features and the attribute features into configured key information to obtain key information output by the model, and taking the key information as a preliminary reference text corresponding to material data, wherein:

the key information determination model is obtained by taking semantic features of content information contained in training data, visual expression features of the content information in the training data and attribute features of the training data as training samples and taking labeled key information corresponding to the training data as sample labels for training.

the step of inputting the semantic features, the visual expression features and the attribute features into configured keywords to determine a model, and obtaining key information output by the model comprises the following steps:

splicing the semantic features, the visual expression features and the attribute features by using the feature splicing layer to obtain splicing features;

Preferably, the determining, based on the preliminary reference text, a reference text set corresponding to the speech to be recognized includes:

combining the preliminary reference texts into a reference text set corresponding to the speech to be recognized;

or the like, or, alternatively,

and performing knowledge graph expansion on the basis of the preliminary reference text to obtain an expanded reference text, and forming a reference text set corresponding to the voice to be recognized by using the preliminary reference text and the expanded reference text.

Preferably, the performing speech recognition on the speech to be recognized by combining the reference text set to obtain a speech recognition result includes:

recognizing the voice to be recognized by using a voice recognition model, and carrying out forward excitation on the reference text in the reference text set in the recognition process to obtain a voice recognition result;

or the like, or, alternatively,

recognizing the voice to be recognized by using a voice recognition model to obtain a preliminary voice recognition result;

and correcting the preliminary voice recognition result by using the reference texts in the reference text set to obtain a corrected voice recognition result.

Preferably, the material data comprises material data in a picture format and/or a non-picture format;

the determining content information contained in the material data includes:

and if the material data is in a picture format, performing OCR (optical character recognition) on the material data in the picture format to obtain content information contained in the material data.

A speech recognition apparatus comprising:

a material data acquisition unit for acquiring material data related to a voice to be recognized;

a content information determination unit configured to determine content information included in the material data;

a preliminary reference text determining unit, configured to determine a preliminary reference text corresponding to the material data based on at least a semantic feature of the content information and a visual expression feature of the content information in the material data;

a reference text set determining unit, configured to determine, based on the preliminary reference text, a reference text set corresponding to the speech to be recognized;

and the voice recognition unit is used for carrying out voice recognition on the voice to be recognized by combining the reference text set to obtain a voice recognition result.

Preferably, the preliminary reference text determining unit includes:

the first preliminary reference text determining subunit is used for determining a preliminary reference text corresponding to the material data based on the semantic features of the content information and the visual expression features of the content information in the material data;

or the like, or, alternatively,

and the second preliminary reference text determining subunit is used for determining a preliminary reference text corresponding to the material data based on the semantic features of the content information, the visual expression features of the content information in the material data and the attribute features of the material data.

Preferably, the first preliminary reference text determination subunit includes:

a first feature determination unit, configured to determine semantic features of the content information and visual expression features of the content information in the material data;

the first model determining unit is used for inputting the semantic features and the visual expression features into a configured key information determining model to obtain key information output by the model, and the key information is used as a preliminary reference text corresponding to the material data, wherein:

Preferably, the process of determining the semantic features of the content information by the first feature determination unit includes:

Preferably, the process of determining the visual performance characteristics of the content information in the material data by the first characteristic determination unit includes:

Preferably, the key information determination model may include a feature concatenation layer and a classification discrimination layer, and based on this, the first model determination unit inputs the semantic features and the visual expression features into the configured key information determination model to obtain the key information output by the model, including:

Preferably, the second preliminary reference text determination subunit includes:

a second feature determination unit, configured to determine semantic features of the content information, visual expression features of the content information in the material data, and attribute features of the material data;

a second model determining unit, configured to determine a model from the semantic features, the visual expression features, and the attribute features, to obtain key information output by the model, which is used as a preliminary reference text corresponding to the material data, where:

Preferably, the key information determination model may include a feature concatenation layer and a classification discrimination layer, and based on this, the second model determination unit inputs the semantic features, the visual expression features, and the attribute features into a configured keyword determination model to obtain the key information output by the model, including:

Preferably, the reference text set determining unit includes:

a first reference text set determining subunit, configured to combine the preliminary reference texts into a reference text set corresponding to the speech to be recognized;

or the like, or, alternatively,

and the second reference text set determining subunit is used for performing knowledge graph expansion on the basis of the preliminary reference text to obtain an expanded reference text, and the preliminary reference text and the expanded reference text form a reference text set corresponding to the voice to be recognized.

Preferably, the voice recognition unit includes:

the first voice recognition subunit is used for recognizing the voice to be recognized by using a voice recognition model and carrying out forward excitation on the reference texts in the reference text set in the recognition process to obtain a voice recognition result;

or the like, or, alternatively,

the second voice recognition subunit is used for recognizing the voice to be recognized by using a voice recognition model to obtain a preliminary voice recognition result; and correcting the preliminary voice recognition result by using the reference texts in the reference text set to obtain a corrected voice recognition result.

Preferably, the material data includes material data in a picture format and/or a non-picture format, and based on this, the process of the content information determination unit determining the content information included in the material data includes:

A speech recognition device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the voice recognition method.

A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method as described above.

By the technical scheme, the speech recognition method can acquire material data related to the speech to be recognized, the material data can be a lecture draft and the like when the speech to be recognized is speech, so the method can further determine content information from the material data, the content information comprises the information related to the speech to be recognized, the method can determine a preliminary reference text capable of assisting speech recognition based on semantic characteristics of the content information and visual expression characteristics of the content information in the material data, such as professional vocabularies, expressions and the like from the lecture draft, further determine a reference text set corresponding to the speech to be recognized based on the preliminary reference text, perform speech recognition on the speech to be recognized by combining the reference text set, and determine the reference text set capable of assisting the speech recognition by means of the material data related to the speech to be recognized, therefore, text information such as professional terms and the like can be acquired in advance to serve as prior information to assist in recognizing the speech to be recognized, and the accuracy of the speech recognition result can be obviously improved greatly.

Furthermore, considering that reference texts capable of assisting in speech recognition are generally more critical information, and the visual representation of the critical information in the material data is generally different from other information, the method determines the preliminary reference text process, simultaneously considers the semantic features of the content information and the visual representation features of the content information in the material data, and the preliminary reference text and the reference text set determined based on the determination also contain more critical information, so that the speech recognition can be better assisted, and the accuracy of the speech recognition result is further improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a PPT lecture manuscript according to an example of the present application;

FIG. 3 illustrates a diagram of a key information determination model determining key information corresponding to content information;

FIG. 4 illustrates another exemplary key information determination model for determining key information corresponding to content information;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides a voice recognition scheme, which can be applied to voice recognition equipment, and the voice recognition equipment can have voice data recognition processing capacity. Generally, the voice recognition device may be a common electronic device with data processing capability, such as a mobile phone, a computer, an IPAD, a server, a cloud, and so on.

An optional applicable scenario is that in an academic report meeting or a speech occasion, speech recognition needs to be performed on the speech spoken by a speaker, the speech is organized into meeting summary in a text form, or language translation is performed based on a speech recognition result, and the requirements are based on a high-accuracy speech recognition result. However, in some cases, the speaker may speak some more specialized terms, and the general speech recognition model is not accurate for recognizing such specialized terms. For example, when a speaker is introducing a go game of "alpha dog" and "lie stone", the term "throw" may be spoken, and a general speech recognition model may be recognized as "investment", resulting in speech recognition errors.

Therefore, the application provides a voice recognition method, which aims to solve the problems and improve the accuracy of voice recognition.

As described in conjunction with fig. 1, the speech recognition method of the present application may include the following steps:

and S100, acquiring material data related to the voice to be recognized.

Specifically, the material data is related to the speech to be recognized, and may include material data such as lecture manuscripts related to the speech to be recognized, for example, PPT or word documents, speaker information, conference topics, and conference key information added by the host or the user according to the conference requirement.

The material data may be acquired in advance before speech recognition is performed on the speech to be recognized. For example, a lecture manuscript is provided in advance by a speaker before a meeting.

In addition, the material data may also be acquired during speech recognition of the speech to be recognized. For example, in some scenarios, a presentation document cannot be acquired in advance, and the presentation document can only be played through a screen when a presentation starts. The method and the device can acquire the picture of the speech manuscript through the corresponding photographic equipment when the speech manuscript can be seen as the material data related to the voice to be recognized.

It is understood that the material data related to the speech to be recognized may be in a picture format, such as being photographed from a paper document or a display screen on which the material data is played, or the material data itself may be electronic data in a picture format. In addition, the material data related to the voice to be recognized can also be in a non-picture format, such as a word document in an electronic form.

The format of the material data is not strictly limited in this application.

Step S110, content information contained in the material data is determined, and a preliminary reference text corresponding to the material data is determined at least based on semantic features of the content information and visual expression features of the content information in the material data.

Specifically, the content information included in the material data may be of various types, such as text content information, multimedia content information, and the like, where the multimedia content information includes image content, audio-video content, and the like. In an exemplary scenario, for example, in a speech process, speech recognition needs to be performed on the speech of a speaker, a speech manuscript PPT of the speaker can be acquired as material data, the PPT may include text information, image information, audio and video information, and the information can be determined from the material data.

It is understood that, depending on the format of the material data, the manner of determining the content information included in the material data may also be different, for example, when the material data is in a picture format, OCR (optical character Recognition) Recognition may be performed on the material data in the picture format to obtain the content information included in the material data. When the material data is in a non-picture format, the content information contained in the material data in an electronic form can be directly acquired, for example, for a word document in an electronic form, the content information recorded in the word document can be directly acquired.

Furthermore, considering that the reference text capable of assisting speech recognition is generally relatively critical information, the importance of the content information can be analyzed through semantic features, that is, whether the content information is critical information or not can be analyzed. On the basis, the visual representation of the key information in the material data is generally different from other information, for example, the speaker thickens the important information in the lecture manuscript, marks the font color, makes the font position more striking, and the like, so that the key information is different from other information in the visual representation. Therefore, in the process of determining the preliminary reference text, the semantic features of the content information and the visual expression features of the content information in the material data are considered, and the preliminary reference text determined based on the semantic features also contains more key information, so that the voice recognition can be better assisted.

And step S120, determining a reference text set corresponding to the voice to be recognized based on the preliminary reference text.

In particular, the last step, after determining the corresponding preliminary reference text based on the material data, may further determine a set of reference texts corresponding to the speech to be recognized based on the preliminary reference text.

Wherein the reference text in the reference text set may comprise preliminary reference text.

And the reference text in the reference text set corresponding to the voice to be recognized is used for assisting the voice recognition of the voice to be recognized.

And step S130, carrying out voice recognition on the voice to be recognized by combining the reference text set to obtain a voice recognition result.

It can be understood that the reference text in the reference text set is determined based on the material data related to the speech to be recognized, and is used for assisting the speech recognition of the speech to be recognized, so that the speech to be recognized can be subjected to the speech recognition by combining the reference text set to obtain the speech recognition result.

The speech recognition method can acquire material data related to the speech to be recognized, for example, when the speech to be recognized is speech, the material data can be lecture manuscripts and the like, so the method can further determine the contained content information from the material data, the content information contains the information related to the speech to be recognized, the method can determine a preliminary reference text capable of assisting the speech recognition based on the semantic features of the content information and the visual expression features of the content information in the material data, for example, professional vocabularies, expressions and the like are determined from the lecture manuscripts, further determine a reference text set corresponding to the speech to be recognized based on the preliminary reference text, perform the speech recognition on the speech to be recognized by combining the reference text set, and determine the reference text set capable of assisting the speech recognition by means of the material data related to the speech to be recognized, therefore, text information such as professional terms and the like can be acquired in advance to serve as prior information to assist in recognizing the speech to be recognized, and the accuracy of the speech recognition result can be obviously improved greatly.

Optionally, as for the semantic features of the content information in the above embodiment, the semantic features represent the content information from a semantic level, and can carry the semantic level features of the content information. Alternatively, the semantic features may be feature vectors capable of representing semantics of the content information, such as semantic feature vectors representing the content information by hidden features of an existing neural network model that processes the content information.

Further optionally, as for the visual representation features of the content information in the material data in the above embodiments, the visual representation features represent the content information from a visual hierarchy, and can carry the features of the visual hierarchy of the content information. Wherein the visual representation characteristics may include any one or a combination of:

the format, layout, position, etc. of the content information in the material data.

The format may include font size, type, presence or absence of bolding, underlining, space size, and the like, among others. The layout indicates the layout of the content information in the material data. The position indicates position information of the content information in the material data.

In an alternative mode, if the content information includes both text information and multimedia information, a part of the text information (e.g., a part of the text information whose position distance from the multimedia information is within a set distance range) that is generally located around the multimedia information should have a higher weight to the key information, i.e., a higher probability to belong to the key information, and is more likely to be a preliminary reference text than another part of the text information (e.g., a part of the text information whose position distance from the multimedia information is not within the set distance range) that is far away from the multimedia information. It can be understood that, according to the typesetting manner of the content information, the visual representation characteristics of the text information at different positions in the material data are different, the visual representation characteristics of the part of text information around the multimedia information in the material data can better represent the criticality of the part of text information, and subsequently, based on the semantic characteristics and the visual representation characteristics, the part of text information can be determined to belong to the preliminary reference text.

Referring to fig. 2, a diagram of a lecture script PPT is illustrated. When the PPT illustrated in fig. 2 is taken as material data, content information in the PPT, including text and images, can be acquired.

For the text therein, it can be seen that the "patent meaning and its role" is bolded and slanted, and its font size is also larger than other text. And the position is positioned at the head of the PPT page, which is important.

Further, for the PPT which is beneficial to the long-term development of enterprises, the shading is arranged, and the shading can be seen to be positioned around the image, so that the important information can be seen.

Still further, the "company a" and "company B" in the PPT are also displayed in bold, so that they are also important information.

In another embodiment of the present application, a process of determining a preliminary reference text corresponding to the material data based on at least the semantic features of the content information and the visual representation features thereof in the material data in step S110 is described.

As described above, the semantic features and the visual representation features of the content information have a great influence on determining whether the content information is key information or not and assisting in improving the speech recognition accuracy. Thus, the present application provides two alternative embodiments for step S110, as follows:

a first kind,

The method and the device can determine the preliminary reference text corresponding to the material data based on the semantic features of the content information and the visual expression features of the content information in the material data.

That is, the preliminary reference text corresponding to the material data may be determined with reference to only the semantic features and the visual expression features of the content information.

A second kind,

The applicant researches and discovers that the attribute characteristics of the material data also have great influence on determining a preliminary reference text corresponding to the material data. Taking the material data as the PPT lecture drafts for example, the PPT lecture drafts may have a plurality of different styles, and the display positions, the display modes, and the like of some key information in the PPT lecture drafts with different styles are different.

Therefore, the method and the device can further refer to the attribute characteristics of the material data on the basis of the semantic characteristics and the visual expression characteristics of the reference content information to determine the preliminary reference text corresponding to the material data.

The attribute characteristics of the material data may include a plurality, any one or a combination of the following:

The type of the material data may include a picture type, a word document type, an excel document type, a PPT lecture manuscript type, and the like. The style of the material data can be further subdivided based on different types of material data, such as business style, leisure style and the like for PPT lecture manuscripts. The attribute information of the author of the material data mainly includes attribute information of the author who writes the material data, such as occupation, gender, background, writing preference, and the like of the author, and the attribute information affects visual performance of key information in the material data. For example, for an author working on business, he may make more reference to business terms in the material data, such as "B2B". For authors engaged in artificial intelligence research, there may be more references to terms in the field of artificial intelligence in material data, such as "machine learning", "artificial intelligence", and so forth.

In another embodiment of the present application, the first way of determining the preliminary reference text described above is first explained.

A first procedure for determining a preliminary reference text may comprise the steps of:

s1, determining semantic features of the content information and visual representation features of the content information in the material data.

Specifically, for the determination process of the semantic features of the content information, a corresponding determination mode may be configured according to the type of the content information.

The content information may include a text information type and a multimedia information type. The corresponding semantic feature determination methods are as follows:

1. and if the content information is text information, performing word segmentation on the text information, and determining a word vector of each word segmentation as a semantic feature of the word segmentation.

For the determined content information, if the text information is inquired, word segmentation processing can be firstly carried out on the text information to obtain a plurality of word segments after word segmentation. Further, a word vector for each segmented word may be determined as a semantic feature of the segmented word.

Still taking the material data illustrated in fig. 2 as an example, the content information that can be determined therefrom includes text information: "patent meaning and action".

The text information may be first segmented, and the obtained segmentation includes: "patent", "meaning", "and", "action". Further, a word vector of each segmented word can be determined as a semantic feature of the corresponding segmented word.

2. And if the content information is multimedia information, searching relevant text information by taking the multimedia information as a searching condition, segmenting the relevant text information, and determining a word vector of each segmented word as the semantic feature of the segmented word.

Specifically, if the content information is multimedia information, such as audio, video, and image, the multimedia information may be used as a search condition to search the database for the relevant text information. The database may be an internet database or an enterprise database.

Taking the material data illustrated in fig. 2 as an example, the content information that can be determined therefrom includes the caricature image in fig. 2. Relevant text information can be retrieved based on the cartoon image, such as an article with the topic of 'intellectual property international enterprise patent layout management' can be retrieved.

After the relevant text information is obtained through retrieval, word segmentation can be performed on the relevant text, and word segmentation word vectors are determined.

Furthermore, the visual representation characteristics of the content information in the material data may be configured in a corresponding manner according to the type of the content information.

The content information may include a text information type and a multimedia information type. The corresponding visual performance characteristics are determined as follows:

1. and if the content information is text information, performing word segmentation on the text information, determining an image area where each word is located in the material data, and determining the visual expression characteristic of each word based on the image area.

For the word segmentation processing operation of the text information, the above description can be referred to.

It is understood that, whether the material data is in a picture format or in an electronic format, the text information has a certain typesetting layout in the material data, and when the material data is regarded as a whole image, each word segmentation in the material data occupies a different image area. Therefore, in this step, the image area where each word is located in the material data can be determined, and the visual expression characteristics of the word are determined based on the image area corresponding to the word.

Specifically, the OCR recognition model may be used to determine visual performance characteristics of the segmented words, that is, the image area corresponding to the segmented words is input into the OCR recognition model, and the OCR recognition model extracts the hidden layer characteristics based on the image area and performs classification recognition based on the hidden layer characteristics to output the segmented words corresponding to the recognized image area. In this embodiment, the hidden layer features extracted by the OCR recognition model are obtained as the visual expression features of the corresponding segmented words.

2. And if the content information is multimedia information, determining an image area of the multimedia information in the material data, and determining the visual expression characteristics of the multimedia information based on the image area.

For multimedia information, the multimedia information can be regarded as a whole, an image area where the multimedia information is located in the material data is determined, and the visual performance characteristics of the multimedia information are determined based on the image area.

Similarly, the visual representation characteristics of the multimedia information may be determined by using an image recognition model, that is, an image region corresponding to the multimedia information is input into the image recognition model, and the model extracts hidden layer characteristics based on the image region and performs image classification and recognition based on the hidden layer characteristics to output an image classification and recognition result corresponding to the recognized image region. In this embodiment, the hidden layer feature extracted by the image recognition model is obtained as the visual representation feature of the multimedia information.

The above has introduced the process of determining semantic features of multimedia information, i.e. the word segmentation result of the related text information of the multimedia information and the semantic features of each word segmentation are obtained. Therefore, in this step, the visual expression characteristic of the multimedia information determined based on the image area may be used as the visual expression characteristic of each participle of the text information related to the multimedia information.

Therefore, no matter the content information is text information or multimedia information, the semantic features and the visual expression features of the multiple participles can be finally obtained.

And S2, inputting the semantic features and the visual expression features into configured key information to determine a model, and obtaining the key information output by the model as a preliminary reference text corresponding to the material data.

Specifically, the semantic features and the visual expression features of each obtained participle may be input into a key information determination model, and the model outputs whether the corresponding participle is key information. And finally, taking the key information output by the model as a preliminary reference text corresponding to the material data.

The key information determination model training process may be obtained by training, with semantic features of content information included in training data and visual expression features of the content information in the training data as training samples, and with labeled key information corresponding to the training data as sample labels.

Next, the structure of the key information determination model will be described in this embodiment.

The key information determination model may include a feature concatenation layer and a classification discrimination layer. Based on the above, after the semantic features and the visual expression features of the content information are input into the key information determination model, the feature splicing layer of the model performs feature splicing on the input semantic features and the visual expression features, the classification and judgment layer further judges whether the corresponding content information is the key information based on the splicing features, and outputs the judged key information.

Referring to fig. 3, fig. 3 illustrates a schematic diagram of determining key information corresponding to content information by using a key information determination model.

The process of determining whether or not it is key information for the text content of "B company" is illustrated in fig. 3. First, the semantic characteristics of "B corporation" and its visual presentation characteristics need to be determined. Further, the semantic features and the visual expression features thereof are input into a key information determination model, feature splicing is carried out on the semantic features and the visual expression features by a feature splicing layer of the model to obtain splicing features, the splicing features are transmitted to a classification discrimination layer, whether the text content of 'company B' is key information is discriminated by the classification discrimination layer based on the splicing features, and the discriminated key information is output.

In yet another embodiment of the present application, the second manner of determining the preliminary reference text described above is further described.

The second procedure for determining the preliminary reference text may include the steps of:

s1, determining semantic features of the content information, visual representation features of the content information in the material data and attribute features of the material data.

The determination process of the semantic features and the visual representation features of the content information may refer to the related description above, and will not be described herein again.

In contrast to the first process of determining the preliminary reference text described above, the process of determining the attribute characteristics of the material data is further added to the present embodiment.

And S2, inputting the semantic features, the visual expression features and the attribute features into configured key information to determine a model, and obtaining the key information output by the model as a preliminary reference text corresponding to the material data.

Different from the key information determination model mentioned in the first process of determining the preliminary reference text described above, in the training process, the key information determination model in this embodiment may be obtained by training, with the semantic features of the content information included in the training data, the visual expression features of the content information in the training data, and the attribute features of the training data as training samples, and with the labeled key information corresponding to the training data as sample labels.

That is, in the embodiment, the attribute features of the training data are added to the training samples of the key information determination model training process.

Obviously, by adding the training sample of the attribute features of the training data, the accuracy of identification and confirmation of the key information by the trained key information determination model is higher.

Further, the structure of the key information determination model in the present embodiment will be described.

The key information determination model may include a feature concatenation layer and a classification discrimination layer. Based on this, after the semantic features, the visual expression features of the content information and the attribute features of the training data are input into the key information determination model, the feature splicing layer of the model performs feature splicing on the input semantic features, the visual expression features and the attribute features of the training data, the classification and judgment layer further judges whether the corresponding content information is the key information based on the splicing features, and outputs the judged key information.

Referring to fig. 4, fig. 4 illustrates another exemplary diagram of determining key information corresponding to content information by using a key information determination model.

A process of determining whether or not it is key information for the text content of "B corporation" in the PPT lecture as material data is illustrated in fig. 4. First, the semantic characteristics of "B corporation" and its visual presentation characteristics need to be determined. Meanwhile, the attribute characteristics of the PPT lecture manuscript are required to be determined. Further, the semantic features, the visual expression features and the attribute features are input into a key information determination model, feature splicing is carried out on the semantic features, the visual expression features and the attribute features through a feature splicing layer of the model to obtain splicing features, the splicing features are transmitted to a classification judging layer, whether the text content of 'company B' is key information or not is judged through the classification judging layer based on the splicing features, and the judged key information is output.

In another embodiment of the present application, a process of determining a reference text set corresponding to the speech to be recognized based on the preliminary reference text in step S120 is described.

It can be understood that the preliminary reference texts may be directly combined into a reference text set corresponding to the speech to be recognized, that is, the reference texts in the reference text set are the preliminary reference texts.

Further, considering that the determined preliminary reference text may not be comprehensive in some scenarios, for example, the lecturer speaks content related to "dog-shaped and lie-shaped weiqi game", and the lecture provided by the lecturer records only two keywords of "dog-shaped and" lie-shaped stone ", which can be determined as the preliminary reference text based on the lecture according to the present embodiment.

But during the presentation the presenter dictates the term "throw". The term "invest" cannot be covered purely by the preliminary reference text, so that there is a high possibility that the speech recognition will be erroneously recognized as "investment".

Therefore, in this embodiment, the knowledge graph may be expanded based on the preliminary reference text to obtain an expanded reference text, where the expanded reference text may be a related text expanded from the preliminary reference text. Furthermore, a reference text set corresponding to the speech to be recognized can be composed of the preliminary reference text and the expanded reference text. That is, the reference texts in the reference text set are composed of the preliminary reference texts and the expanded reference texts.

The above example is still used to illustrate:

after obtaining the preliminary reference texts "alpha dog" and "lie stone", a correlation search may be performed based on the two words "alpha dog" and "lie stone". Since the terms "alpha dog" and "lie stone" are related to the game of go, and "throw" appears in the game of go, the embodiment can be expanded to obtain an expanded reference text of "throw".

In the following example, an alternative implementation of knowledge-graph expansion based on preliminary reference text is presented.

Specifically, in the embodiment of the present application, key information in each field scene, such as hotwords and professional term expressions, may be collected and acquired in advance to form a key information table. Further, for each key information in the key information table, a semantic vector capable of characterizing semantic features of the key information is determined.

After the preliminary reference text is obtained, firstly, semantic vectors capable of representing semantic features of the preliminary reference text are determined, then, the distance between the preliminary reference text and the semantic vectors of each piece of key information in the key information table is calculated, and the key information with the distance smaller than a distance threshold (the distance threshold can be determined according to experiments) is selected as the extended reference text.

Of course, other ways of expanding the knowledge-graph of the preliminary reference text may be used in addition to this. For example, a material text containing a preliminary reference text is retrieved, the number of co-occurrences of each word in the preliminary reference text and the material text is calculated, a word with the co-occurrence number exceeding a threshold number is selected as an expanded reference text, and other alternatives are available.

In another embodiment of the present application, a process of performing speech recognition on the speech to be recognized in the step S130 by combining with the reference text set to obtain a speech recognition result is described.

In an optional manner, since the reference text in the reference text set is likely to be a text included in the recognition result corresponding to the speech to be recognized, the reference text set may be used as prior information, the speech to be recognized is recognized by using the speech recognition model, and the reference text in the reference text set is forward-excited in the recognition process, that is, in the recognition process, if the reference text is found as a candidate recognition result, the score of the reference text may be forward-excited, so that the probability of the reference text as the final recognition result is improved, and the speech recognition result is finally obtained.

The speech recognition model used in the speech recognition process may be a general speech recognition model or a speech recognition model customized for a specific scene, which is not strictly limited in the present application.

One example scenario is as follows:

the reference text set contains: "Luo-editation thinking". The text corresponding to the speech to be recognized is: "2020 Rough thinking, Lowe Shanghai, Lowe and Yu Shanghai, inter-year lecture entrance ticket performance information detail introduction".

In the process of recognizing the speech to be recognized, when the speech segment corresponding to the 'Rough thinking' is recognized, the speech recognition model recognition process can recognize two candidate recognition results of 'logical thinking' and 'Rough thinking'.

According to the prior art, because the logical thinking has higher frequency than the logical thinking, the former score is higher than the latter, and the final recognition result is the logical thinking, which is obviously not the correct recognition result.

In this embodiment, it is found in the recognition process that "the romantic thinking" belongs to the reference text, and therefore, the score of the reference text is excited in a forward direction, so that the excited score may exceed "the logical thinking", and therefore, the "the romantic thinking" corresponding to the speech fragment can be correctly recognized finally.

In another alternative, a speech recognition model may be used to recognize a speech to be recognized first, so as to obtain a preliminary speech recognition result. Furthermore, the preliminary voice recognition result is corrected by using the reference texts in the reference text set, so that a corrected voice recognition result is obtained.

It will be appreciated that the preliminary speech recognition result may contain erroneous recognized text since it is not aided by using any reference text. Therefore, the preliminary speech recognition result can be corrected using the reference text in the reference text set.

An optional modification method is to match the reference text in the preliminary speech recognition result, determine whether there is a target text unit in the preliminary speech recognition result whose matching degree exceeds a set matching degree threshold, and if so, replace the target text unit in the preliminary speech recognition result with the reference text.

Still illustrated with the above example scenario:

The logical thinking is more frequent than the logical thinking, so the score of the logical thinking is higher than that of the logical thinking, and the final obtained preliminary identification result is' 2020 the detailed introduction of the entrance ticket performance information of the cross-year lecture in logical thinking and Losojou, but obviously the preliminary result is not a correct identification result.

On this basis, the embodiment further corrects the preliminary recognition result by using the reference text "ro-ti", for example, the reference text "ro-ti" is matched in the preliminary recognition result, and it is found that the matching degree of "logical sense" in the preliminary recognition result and the reference text "ro-ti" reaches 75%, and exceeds a set threshold, so that the "logical sense" in the preliminary recognition result is directly replaced by the reference text "ro-ti", and the final recognition result is obtained as follows: "2020 Rough thinking, Lowe Shanghai, and Cross-year lecture entrance ticket performance information detail introduction", it is obvious that the final recognition result is correct.

Of course, the above only illustrates an alternative embodiment of correcting the preliminary speech recognition result by using the reference text set, and other ways may be used to implement the correction.

The following describes a speech recognition apparatus provided in an embodiment of the present application, and the speech recognition apparatus described below and the speech recognition method described above may be referred to correspondingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech recognition apparatus disclosed in the embodiment of the present application.

As shown in fig. 5, the apparatus may include:

a material data acquisition unit 11 for acquiring material data relating to a voice to be recognized;

a content information determination unit 12 for determining content information contained in the material data;

a preliminary reference text determining unit 13, configured to determine a preliminary reference text corresponding to the material data based on at least a semantic feature of the content information and a visual representation feature of the content information in the material data;

a reference text set determining unit 14, configured to determine, based on the preliminary reference text, a reference text set corresponding to the speech to be recognized;

and the voice recognition unit 15 is configured to perform voice recognition on the voice to be recognized by combining the reference text set, so as to obtain a voice recognition result.

Optionally, the preliminary reference text determining unit 13 may include:

or the like, or, alternatively,

Optionally, the first preliminary reference text determining subunit may include:

Optionally, the process of determining the semantic features of the content information by the first feature determining unit may include:

Optionally, the process of determining the visual performance characteristics of the content information in the material data by the first characteristic determining unit may include:

Optionally, the key information determination model may include a feature splicing layer and a classification discrimination layer. Based on this, the process of inputting the semantic features and the visual expression features into the configured key information determination model by the first model determination unit to obtain the key information output by the model may include:

Optionally, the second preliminary reference text determining subunit may include:

Optionally, the key information determination model may include a feature splicing layer and a classification discrimination layer. Based on this, the process of the second model determining unit inputting the semantic features, the visual expression features and the attribute features into the configured keyword determination model to obtain the key information output by the model may include:

Optionally, the reference text set determining unit 14 may include:

or the like, or, alternatively,

Optionally, the voice recognition unit 15 may include:

or the like, or, alternatively,

Optionally, the material data may include material data in a picture format and/or a non-picture format. Based on this, the process of determining the content information included in the material data by the content information determination unit 12 may include:

The voice recognition device provided by the embodiment of the application can be applied to voice recognition equipment such as mobile phones, computers and the like. Alternatively, fig. 6 shows a block diagram of a hardware structure of the speech recognition device, and referring to fig. 6, the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring material data related to the voice to be recognized;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

acquiring material data related to the voice to be recognized;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

acquiring material data related to the voice to be recognized;

2. The method according to claim 1, wherein the determining a preliminary reference text corresponding to the material data based on at least the semantic features of the content information and the visual representation features of the content information in the material data comprises:

or the like, or, alternatively,

3. The method of claim 2,

4. The method according to claim 2, wherein the determining a preliminary reference text corresponding to the material data based on the semantic features of the content information and the visual representation features of the content information in the material data comprises:

5. The method of claim 4, wherein determining the semantic features of the content information comprises:

6. The method of claim 4, wherein determining the visual representation of the content information in the material data comprises:

7. The method of claim 4, wherein the key information determination model comprises a feature concatenation layer and a classification discrimination layer;

8. The method according to claim 2, wherein the determining a preliminary reference text corresponding to the material data based on the semantic features of the content information, the visual representation features of the content information in the material data, and the attribute features of the material data comprises:

9. The method of claim 8, wherein the key information determination model comprises a feature concatenation layer and a classification discrimination layer;

10. The method according to claim 1, wherein the determining the reference text set corresponding to the speech to be recognized based on the preliminary reference text comprises:

or the like, or, alternatively,

11. The method according to claim 1, wherein performing speech recognition on the speech to be recognized in combination with the reference text set to obtain a speech recognition result comprises:

or the like, or, alternatively,

12. The method of claim 1, wherein the material data comprises material data in a picture format and/or a non-picture format;

the determining content information contained in the material data includes:

13. A speech recognition apparatus, comprising:

14. A speech recognition device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech recognition method according to any one of claims 1 to 12.

15. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 12.