CN113299277A

CN113299277A - Voice semantic recognition method and system

Info

Publication number: CN113299277A
Application number: CN202110621932.8A
Authority: CN
Inventors: 姚娟娟; 樊代明; 钟南山
Original assignee: Mingpinyun Beijing Data Technology Co Ltd
Current assignee: Mingpinyun Beijing Data Technology Co Ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-08-24

Abstract

The invention provides a voice semantic recognition method and a system, wherein the voice semantic recognition method comprises the following steps: collecting a training sample set, wherein the training sample set comprises: the voice recognition method comprises a voice sample group and annotation information, wherein the voice sample group comprises: mandarin samples and dialect samples with the same voice content; inputting a training sample set into a semantic recognition network, wherein the semantic recognition network comprises: a speech recognition subnetwork, a long-short term memory subnetwork for obtaining a first semantic label, and a convolutional neural subnetwork for obtaining a second semantic label; training a semantic recognition network according to the first semantic label and the second semantic label to obtain a semantic recognition model; inputting the speech to be recognized into a semantic recognition model for semantic recognition to complete speech semantic recognition; the voice semantic recognition method improves the accuracy of voice recognition, has higher recognition speed, realizes the accurate recognition of voice semantics and effectively improves the accuracy of voice semantic recognition.

Description

Voice semantic recognition method and system

Technical Field

The invention relates to the field of machine recognition, in particular to a speech semantic recognition method and a speech semantic recognition system.

Background

In order to meet the voice recognition requirements of people, the voice recognition technology develops more rapidly, and because voice has a plurality of dialect types, the voice is generally input into a plurality of dialect databases to be matched for a plurality of times to complete voice recognition, however, the matching times are more, the operation load is larger, and the recognition speed is slower; further, when speech semantic recognition is required, currently, semantic matching is usually performed on keywords in speech to determine speech semantics, however, the way of determining speech semantics by performing semantic matching on keywords in speech is not analyzed for context information in speech, so that the semantic recognition accuracy is low, and poor experience is brought to users.

Disclosure of Invention

The invention provides a voice semantic recognition method and a voice semantic recognition system, which are used for solving the problems of more matching times, larger running load, slower recognition speed and lower voice semantic recognition accuracy when recognizing a speech in the prior art.

The speech semantic recognition method provided by the invention comprises the following steps:

acquiring a training sample set, the training sample set comprising: a speech sample set and annotation information, the speech sample set comprising: mandarin samples and dialect samples with the same voice content;

inputting the training sample set into a semantic recognition network, the semantic recognition network comprising: a speech recognition subnetwork, a long-short term memory subnetwork for obtaining a first semantic label, and a convolutional neural subnetwork for obtaining a second semantic label;

training a semantic recognition network according to the first semantic label and the second semantic label to obtain a semantic recognition model;

and inputting the voice to be recognized into the semantic recognition model for semantic recognition to finish voice semantic recognition.

Optionally, the mandarin sample and the dialect sample with the same voice content have a first association relationship;

inputting the training sample set into a voice recognition sub-network in the semantic recognition network for voice feature extraction to obtain voice features;

classifying and labeling the voice features, and determining voice feature categories, wherein the voice feature categories comprise: mandarin and dialect;

determining a second incidence relation between the voice features of different classes according to the first incidence relation and the voice feature classes;

and acquiring a voice text by using the second association relation to complete voice recognition.

Optionally, attribute information in the mandarin sample or the dialect sample is obtained, where the attribute information at least includes one of: region information and identity information;

determining one or more association types of the Mandarin sample or the dialect sample according to the regional information and/or the identity information;

and inputting the voice features of the Mandarin sample or the dialect sample into a voice feature library of a corresponding type according to the association type, and performing feature matching to obtain a voice text.

Optionally, the step of inputting the speech features of the mandarin chinese sample or the dialect sample into the speech feature library of the corresponding type according to the association type includes:

acquiring weights corresponding to the plurality of association types according to a preset weight distribution rule;

acquiring a matching sequence of the voice features and the voice feature libraries of different types according to weights corresponding to the plurality of association types;

and according to the matching sequence, sequentially inputting the voice features into a corresponding voice feature library for feature matching to obtain a voice text.

Optionally, performing word segmentation processing on the voice text output by the voice recognition sub-network to obtain one or more word segmentation vocabularies;

acquiring the word frequency and the reverse file frequency of the word segmentation vocabulary;

determining vocabulary scores of the word segmentation vocabularies according to the word frequency and the reverse file frequency;

according to the vocabulary scores and a preset score threshold value, performing truncation filtering on the participle vocabularies to obtain noise-reducing vocabularies;

inputting the noise reduction vocabulary into the long-short term memory sub-network, and extracting semantic features according to context information to obtain semantic feature vectors;

and acquiring a first semantic label of the voice text according to the semantic feature vector.

Optionally, inputting the noise-reduction vocabulary into the convolutional neural subnetwork to perform feature extraction, and acquiring vocabulary features;

vectorizing the vocabulary characteristics to obtain vocabulary characteristic vectors;

labeling and classifying the vocabulary feature vectors to obtain class labels of the vocabulary feature vectors;

inputting the vocabulary feature vectors into a semantic library for feature vector matching according to the category labels to obtain matching results;

and acquiring the second semantic label according to the matching result.

Optionally, the step of obtaining the second semantic label according to the matching result includes:

presetting a semantic library, wherein the semantic library comprises: a reference feature vector and a reference semantic tag, the reference feature vector being associated with the reference semantic tag;

classifying the reference feature vectors to obtain one or more reference feature vector sets, wherein the reference feature vector sets are different in category;

matching the category label with the reference feature vector set to obtain a corresponding reference feature vector set;

matching the vocabulary characteristic vectors with the reference characteristic vectors in the corresponding reference characteristic vector set to obtain a first matching degree;

and when the first matching degree is greater than or equal to a preset first matching degree threshold value, determining a reference semantic label associated with the reference feature vector as the second semantic label.

Optionally, the step of training the semantic recognition network according to the first semantic label and the second semantic label includes:

performing similarity matching on the first semantic label and the second semantic label to obtain a second matching degree;

when the second matching degree is larger than or equal to a preset second matching degree threshold value, determining the first semantic label or the second semantic label as a semantic recognition result;

when the second matching degree is smaller than the second matching degree threshold value, determining the semantic label with higher priority as a semantic recognition result according to a preset priority rule;

and training the semantic recognition network according to the semantic recognition result to obtain a semantic recognition model.

Optionally, a related information base is set, where the related information base includes: text data and recommendation information, the recommendation information comprising: text information and voice information, the text data being associated with the recommendation information;

matching the semantic recognition result with the text data, and determining corresponding text data and corresponding recommendation information;

and recommending the associated information according to the recommendation information.

The invention also provides a speech semantic recognition system, comprising:

an acquisition module for acquiring a training sample set, the training sample set comprising: a speech sample set and annotation information, the speech sample set comprising: mandarin samples and dialect samples with the same voice content;

a model acquisition module configured to input the training sample set into a semantic recognition network, where the semantic recognition network includes: a speech recognition subnetwork, a long-short term memory subnetwork for obtaining a first semantic label, and a convolutional neural subnetwork for obtaining a second semantic label; training a semantic recognition network according to the first semantic label and the second semantic label to obtain a semantic recognition model;

the semantic recognition module is used for inputting the voice to be recognized into the semantic recognition model for semantic recognition to finish voice semantic recognition; the acquisition module, the model acquisition module and the semantic identification module are connected.

The invention has the beneficial effects that: the speech semantic recognition method of the invention improves the accuracy of speech recognition by inputting a training sample set comprising a Mandarin sample and a dialect sample with the same speech content into a semantic recognition network for training, has higher recognition speed, trains the semantic recognition network by obtaining a first semantic label and a second semantic label and utilizing the first semantic label and the second semantic label to obtain a semantic recognition model, inputs the speech to be recognized into the semantic recognition model for semantic recognition, realizes the accurate recognition of speech semantics and effectively improves the accuracy of speech semantic recognition.

Drawings

FIG. 1 is a flow chart of a speech semantic recognition method according to an embodiment of the present invention.

FIG. 2 is a flow chart illustrating speech recognition in the speech semantic recognition method according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart of the first semantic tag extraction in the speech semantic recognition method according to the embodiment of the present invention.

Fig. 4 is a schematic flow chart of the second semantic tag extraction in the speech semantic recognition method according to the embodiment of the present invention.

FIG. 5 is a schematic flow chart of training a semantic recognition network in the speech semantic recognition method according to the embodiment of the present invention.

FIG. 6 is a schematic structural diagram of a speech semantic recognition system according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The inventor finds that with the development of the field of machine recognition, the speech recognition technology develops more rapidly, and because the speech has a plurality of dialect types, the speech is generally input into a plurality of dialect databases to be matched for a plurality of times to complete the speech recognition, however, the matching times are more, the operation load is larger, and the recognition speed is slower; further, when speech semantic recognition is required, currently, semantic matching is usually performed on keywords in speech to determine speech semantics, however, the way of determining speech semantics by performing semantic matching on keywords in speech is not analyzed for context information in speech, so that the semantic recognition accuracy is low, and poor experience is brought to users. Therefore, the invention provides a speech semantic recognition method and a system, wherein a training sample set comprising a Mandarin sample and a dialect sample with the same speech content is input into a semantic recognition network for training, so that the accuracy of speech recognition is improved, the semantic recognition network is trained by acquiring a first semantic tag and a second semantic tag and utilizing the first semantic tag and the second semantic tag to acquire a semantic recognition model, and speech to be recognized is input into the semantic recognition model for semantic recognition, so that the accurate recognition of speech semantics is realized, the accuracy of speech semantic recognition is effectively improved, the practicability is high, and the cost is low.

As shown in fig. 1, the speech semantic recognition method in this embodiment includes:

s101: acquiring a training sample set, the training sample set comprising: a speech sample set and annotation information, the speech sample set comprising: mandarin samples and dialect samples with the same voice content; and providing a data basis for training the semantic recognition network by acquiring the training sample set. By collecting multiple groups of Mandarin samples and dialect samples, the recognition compatibility of the semantic recognition network to different types of voices is improved, and the accuracy of dialect recognition in the voice recognition process is improved.

S102: inputting the training sample set into a semantic recognition network, the semantic recognition network comprising: a speech recognition subnetwork, a long-short term memory subnetwork for obtaining a first semantic label, and a convolutional neural subnetwork for obtaining a second semantic label; for example: the method comprises the steps of inputting a training sample set into a voice recognition sub-network for voice recognition, obtaining a voice text, inputting the semantic text into a long-short term memory sub-network and a convolution neural sub-network respectively for semantic tag extraction, obtaining a first semantic tag and a second semantic tag, and performing semantic tag extraction and fusion respectively by arranging a multi-level sub-network, so that the semantic recognition accuracy can be improved well, and the implementation is convenient.

S103: training a semantic recognition network according to the first semantic label and the second semantic label to obtain a semantic recognition model; for example: the method comprises the steps of determining a semantic recognition result by carrying out similarity matching on a first semantic label and a second semantic label, training a semantic recognition network according to the semantic recognition result, and obtaining a better semantic recognition model by adjusting the gradient descending speed, the iterative learning rate, the learning times and the iterative times.

S104: acquiring a voice to be recognized; for example: voice uttered from the patient terminal or voice uttered from the doctor terminal is collected as voice to be recognized.

S105: and inputting the voice to be recognized into the semantic recognition model for semantic recognition to finish voice semantic recognition. The semantic recognition result is obtained by performing semantic recognition on the trained semantic recognition model of the speech input to be recognized, the recognition accuracy is higher, and the implementability is stronger, for example: in the medical field, when a patient sends out a voice to be recognized through a mobile terminal, the voice to be recognized is input into a semantic recognition model for semantic recognition, a more accurate semantic recognition result is obtained, a doctor can be helped to quickly know the semantic expression of the patient and quickly make feedback, or corresponding recommendation information is automatically matched according to the semantic recognition result, and feedback is made according to the recommendation information.

As shown in fig. 2, in order to improve the accuracy of the speech recognition sub-network, the inventor proposes that a mandarin sample having the same speech content has a first correlation with a dialect sample. Because the dialect of the existing voice is likely to be larger, the training of the semantic recognition network is facilitated by establishing the first association relationship between the Mandarin sample and the dialect sample with the same voice content, and further, the voice recognition step comprises the following steps:

s201: inputting the training sample set into a voice recognition sub-network in the semantic recognition network for voice feature extraction to obtain voice features;

in order to reduce sample noise in the speech recognition process, the inventor proposes that the step of inputting the training sample set into a speech recognition sub-network in the semantic recognition network for speech feature extraction includes:

filtering Mandarin samples and dialect samples in the training sample set to obtain noise reduction samples;

and inputting the noise reduction sample into the voice recognition sub-network for voice feature extraction to obtain voice features. Through carrying out filtering processing to mandarin sample and dialect sample, can effectively reduce the noise in the sample, have better noise reduction effect, for example: through wavelet transformation and a threshold setting mode, filtering processing is carried out on Mandarin samples and dialect samples in the training sample set, noise reduction samples are obtained, sample noise is reduced, and accuracy of voice recognition is improved.

S202: classifying and labeling the voice features, and determining voice feature categories, wherein the voice feature categories comprise: mandarin and dialect;

s203: determining a second incidence relation between the voice features of different classes according to the first incidence relation and the voice feature classes; i.e. different classes of speech features of the same content are correlated, for example: the first voice feature and the second voice feature have a second association relationship, the category of the first voice feature is Mandarin or dialect, the first voice feature is different from the second voice feature, the category of the first voice feature is different from the category of the second voice feature, the first voice feature and the second voice feature have the same content, and the content means that the voice texts corresponding to the first voice feature and the second voice feature are the same.

S204: and acquiring a voice text by using the second association relation to complete voice recognition. And by establishing a second incidence relation among different semantic features, the voice recognition capability of the voice recognition sub-network is improved, and further the voice recognition is completed.

In order to accelerate the matching speed of the voice characteristics in the voice recognition process, the inventor proposes: the step of obtaining the voice text comprises the following steps:

obtaining attribute information in the Mandarin sample or the dialect sample, wherein the attribute information at least comprises one of the following items: region information and identity information; the region information is the position information of the user terminal sending the Mandarin sample or the dialect sample, and the region information can be obtained in a GPS positioning mode; the identity information is the identity information of the user corresponding to the mandarin chinese sample or the dialect sample, and the corresponding identity information may be obtained from a user terminal, for example: the method comprises the steps of obtaining the native place of a user from a user terminal, such as Chongqing, Shandong and Beijing.

Determining one or more association types of the Mandarin sample or the dialect sample according to the regional information and/or the identity information; for example: when the position of the user terminal sending the Mandarin sample or the dialect sample is located at Chongqing, namely the region information is Chongqing, and the native in the identity information in the user terminal is Beijing, determining that the correlation type of the Mandarin sample or the dialect sample is Chongqing and Beijing;

and inputting the voice features of the Mandarin sample or the dialect sample into a voice feature library of a corresponding type according to the association type, and performing feature matching to obtain a voice text. By utilizing the association type, the voice characteristics of the Mandarin sample or the dialect sample are matched with the voice characteristics in the voice characteristic library of the corresponding type, so that the speed of matching the voice characteristics can be obviously increased, the speed and the accuracy of voice recognition are improved, and the implementation is convenient. For example: and when the correlation types of the Mandarin sample or the dialect sample are Chongqing and Beijing, performing feature matching on the voice feature input types of the Mandarin sample or the dialect sample in a Chongqing voice feature library and a Beijing voice feature library to obtain a voice text, and realizing the rapid recognition of the voice.

In order to further speed up the speech recognition, the inventors propose: according to the association type, the step of inputting the voice characteristics of the Mandarin sample or the dialect sample into the voice characteristic library of the corresponding type comprises the following steps:

acquiring weights corresponding to the plurality of association types according to a preset weight distribution rule; the weight distribution rule may be set according to an actual situation, for example, the weight of the association type corresponding to the region information is set as a first weight, the weight of the association type corresponding to the identity information is set as a second weight, and values of the first weight and the second weight may be set according to an actual requirement, which is not described herein again.

Acquiring a matching sequence of the voice features and the voice feature libraries of different types according to weights corresponding to the plurality of association types; for example: and when the weight of the association type corresponding to the domain information is greater than the weight of the association type corresponding to the identity information, sequencing the matching sequence according to the weight. For another example: when the association type corresponding to the region information is Shandong, the association type corresponding to the identity information is Beijing, and the weight of the association type corresponding to the region information is greater than that of the association type corresponding to the identity information, determining that the corresponding matching sequence is as follows: and e, Shandong and Beijing, namely inputting the voice characteristics of the Mandarin sample or the dialect sample into a Shandong voice characteristic library for characteristic matching, inputting the voice characteristics of the Mandarin sample or the dialect sample into a Beijing voice characteristic library for characteristic matching, and reducing the load in the characteristic matching process by setting the matching sequence.

And according to the matching sequence, the voice features are sequentially input into a voice feature library of a corresponding type to perform feature matching, so that a voice text is obtained, and the voice recognition efficiency and accuracy are effectively improved.

In order to realize load balance between a terminal and a cloud in a voice recognition process, the inventor provides:

setting a voice feature library corresponding to the association type of the voice features in a terminal, and determining the residual voice feature library;

setting the cloud end for the residual voice feature library;

performing feature matching on the voice features and a corresponding voice feature library in the terminal to obtain a voice text;

and when the confidence of the voice text is smaller than a preset confidence threshold, controlling the terminal to complete calling with a cloud end, inputting the voice characteristics into the cloud end, and performing characteristic matching by using the resources of the cloud end to obtain the voice text. Through setting up the pronunciation feature library of different grade type respectively at terminal and high in the clouds, effectively alleviate the load at terminal for match speed and speech recognition speed.

In some embodiments, when the remaining processing capacity of the terminal processing the voice recognition task is smaller than a preset processing capacity threshold, the voice feature of the mandarin sample or the dialect sample is input to the center cloud for feature matching, so as to obtain a voice text, thereby avoiding unnecessary loss due to insufficient processing capacity of the terminal.

Referring to fig. 3, in order to obtain the first semantic tag, the inventors propose that the step of obtaining the first semantic tag comprises:

s301: performing word segmentation processing on the voice text output by the voice recognition sub-network to obtain one or more word segmentation vocabularies;

s302: acquiring the word frequency and the reverse file frequency of the word segmentation vocabulary;

s303: determining vocabulary scores of the word segmentation vocabularies according to the word frequency and the reverse file frequency; for example: determining vocabulary scores of word segmentation vocabularies according to preset score statistical rules, word frequencies and reverse file frequencies, wherein the score statistical rules can be set according to actual conditions, such as obtaining the product of the word frequencies and the reverse file frequencies, and the like, and are not repeated herein.

S304: according to the vocabulary scores and a preset score threshold value, performing truncation filtering on the participle vocabularies to obtain noise-reducing vocabularies; through right the word segmentation vocabulary cuts and filters, can effectively reduce the noise in the word segmentation vocabulary, plays better noise reduction effect, for example: and removing the word segmentation words with the word scores smaller than the score threshold value, and acquiring the noise reduction words.

S305: inputting the noise reduction vocabulary into the long-short term memory sub-network, and extracting semantic features according to context information to obtain semantic feature vectors;

s306: and acquiring a first semantic label of the voice text according to the semantic feature vector. The noise-reducing words are input into the long-term and short-term memory sub-network for semantic feature extraction, so that the accuracy of semantic feature vector extraction can be improved by combining context information. The long-short term memory subnetwork comprises: the forgetting gate, the input gate and the output gate are controlled to discard or retain the information in the network, for example: and determining discarded information and retained information by using a preset weight matrix and bias, and discarding the discarded information. The method comprises the steps of inputting a noise reduction vocabulary into a long-short term memory sub-network through an input gate, namely inputting a tanh layer, obtaining a semantic feature vector, updating the state of the long-short term memory sub-network, and determining a first semantic label of a voice text through a sigmoid layer according to the semantic feature vector.

As shown in fig. 4, in order to facilitate obtaining the second semantic tag and improve the accuracy of speech semantic recognition, the inventor proposes that the step of obtaining the second semantic tag includes:

s401: inputting the noise-reduction vocabulary into the convolutional neural subnetwork for feature extraction to obtain vocabulary features;

s402: vectorizing the vocabulary characteristics to obtain vocabulary characteristic vectors;

s403: labeling and classifying the vocabulary feature vectors to obtain class labels of the vocabulary feature vectors;

s404: inputting the vocabulary feature vectors into a semantic library for feature vector matching according to the category labels to obtain matching results; by acquiring the class labels of the vocabulary feature vectors and utilizing the class labels, the vocabulary feature vectors are input into a semantic library for feature vector matching, so that the matching speed of the feature vectors can be improved, and the phenomenon that the matching time is too long due to successive matching is avoided.

S405: and acquiring the second semantic label according to the matching result.

In some embodiments, the step of obtaining the second semantic label according to the matching result comprises:

Referring to fig. 5, in order to improve the accuracy of speech semantic recognition, the inventor proposes that a first semantic tag and a second semantic tag are obtained by combining a speech recognition sub-network, a long-short term memory sub-network and a convolutional neural sub-network, and a semantic recognition network is trained according to the first semantic tag and the second semantic tag, wherein the training process comprises:

s501: performing similarity matching on the first semantic label and the second semantic label to obtain a second matching degree;

s502: when the second matching degree is larger than or equal to a preset second matching degree threshold value, determining the first semantic label or the second semantic label as a semantic recognition result;

s503: when the second matching degree is smaller than the second matching degree threshold value, determining the semantic label with higher priority as a semantic recognition result according to a preset priority rule; the priority rule may be set according to an actual situation, for example, the first semantic tag is preferred, or the second semantic tag is preferred, which is not described herein again.

S504: and training the semantic recognition network according to the semantic recognition result to obtain a semantic recognition model.

In order to improve the accuracy of training the semantic recognition network, the inventor proposes that the step of training the semantic recognition network according to the semantic recognition result comprises:

training a semantic recognition network by using a preset loss function according to the semantic recognition result to obtain a semantic recognition model, wherein the mathematical expression of the loss function is as follows:

L_z=α∑_i[f_t×log(f_p)+(1－f_t) ×log(1－f_p)]+β

∑_i|c_t－c_p|+σ∑_ik·log(

)

wherein L is_zAs a loss function, alpha is a preset first weight, f_tIs the true value of the first semantic tag, f_pIs the predicted value of the first semantic label, beta is a preset second weight, n is the number of samples, i is more than or equal to 1 and less than or equal to n, c_tIs the true value of the second semantic tag, c_pAnd the predicted value of the second semantic label is sigma of a preset third weight, and k is the similarity between the predicted value and the true value of the semantic identification result. The true value of the first semantic tag is the same as the true value of the second semantic tag. By the loss function, the network is identified for the semantic meaningFeedback training is carried out, the accuracy of parameters in the semantic recognition network is effectively improved, and the accuracy of voice semantic recognition is improved.

For example: when receiving a voice to be recognized sent from a doctor terminal or a patient terminal, inputting the voice to be recognized into a semantic recognition model to perform voice recognition, first semantic tag extraction and second semantic tag extraction, acquiring a semantic recognition result according to the similarity of the first semantic tag and the second semantic tag, namely the second matching degree, and outputting the semantic recognition result to complete voice semantic recognition, thereby effectively improving the accuracy of voice semantic recognition, helping a patient or a doctor to quickly acquire the semantic of the voice to be recognized, having higher feasibility and being more convenient to implement.

In order to improve the use value of the speech semantic recognition, the inventor proposes that the step of completing the speech semantic recognition comprises the following steps:

setting a correlation information base, wherein the correlation information base comprises: text data and recommendation information, the recommendation information comprising: text information and voice information, the text data being associated with the recommendation information;

and recommending the associated information according to the recommendation information. For example: when the voice recognition result is obtained, the semantic recognition result is matched with the text data in the associated information base, the text data matched with the semantic recognition result and associated recommendation information are obtained, the recommendation information can be diagnosis and treatment schemes, diagnosis and treatment suggestions or diagnosis and treatment voices and the like, intelligent recommendation of the semantic recognition result is achieved, the automation degree is high, and the practicability is high.

As shown in fig. 6, the present embodiment further provides a speech semantic recognition system, including:

the semantic recognition module is used for inputting the voice to be recognized into the semantic recognition model for semantic recognition to finish voice semantic recognition; the acquisition module, the model acquisition module and the semantic identification module are connected. The method comprises the steps of inputting a training sample set comprising a Mandarin sample and a dialect sample with the same voice content into a semantic recognition network for training, improving the accuracy of voice recognition, training the semantic recognition network by obtaining a first semantic label and a second semantic label and utilizing the first semantic label and the second semantic label, obtaining a semantic recognition model, performing semantic recognition on a voice to be recognized input semantic recognition model, realizing accurate recognition of voice semantics, effectively improving the accuracy of voice semantic recognition, and having strong feasibility and low cost.

The model acquisition module comprises: the device comprises a voice recognition unit, a first semantic label acquisition unit, a second semantic label acquisition unit and a training unit.

In some embodiments, the Mandarin sample and the dialect sample having the same voice content have a first association relationship;

the voice recognition unit is used for inputting the training sample set into a voice recognition sub-network in the semantic recognition network to perform voice feature extraction so as to obtain voice features;

In some embodiments, the speech recognition unit obtains attribute information in the Mandarin sample or the dialect sample, the attribute information including at least one of: region information and identity information;

In some embodiments, the step of inputting the speech features of the mandarin chinese sample or the dialect sample into the speech feature library of the corresponding type according to the association type includes:

In some embodiments, the first semantic tag obtaining unit is configured to perform word segmentation on a speech text output by the speech recognition subnetwork, so as to obtain one or more word segmentation vocabularies;

In some embodiments, the second semantic tag obtaining unit is configured to input a noise-reduction vocabulary into the convolutional neural subnetwork for feature extraction, so as to obtain vocabulary features;

and acquiring the second semantic label according to the matching result.

In some embodiments, the training unit is configured to train the semantic recognition network according to the first semantic tag and the second semantic tag, and the training step includes:

In some embodiments, further comprising: the associated information recommendation module is used for matching the semantic recognition result with text data in an associated information base to determine corresponding text data and corresponding recommendation information;

and recommending the associated information according to the recommendation information. The associated information base includes: text data and recommendation information, the recommendation information comprising: text information and voice information, the text data being associated with the recommendation information.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment further provides an electronic terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic terminal provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for completing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program so that the electronic terminal can execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for speech semantic recognition, comprising:

2. The speech semantic recognition method according to claim 1, wherein the mandarin samples with the same speech content have a first association relationship with the dialect samples;

3. The speech semantic recognition method of claim 1,

obtaining attribute information in the Mandarin sample or the dialect sample, wherein the attribute information at least comprises one of the following items: region information and identity information;

4. The speech semantic recognition method according to claim 3, wherein the step of inputting the speech features of the Mandarin sample or the dialect sample into the speech feature library of the corresponding type according to the association type comprises:

5. The speech semantic recognition method of claim 1,

performing word segmentation processing on the voice text output by the voice recognition sub-network to obtain one or more word segmentation vocabularies;

6. The speech semantic recognition method of claim 1,

inputting the noise-reduction vocabulary into the convolutional neural subnetwork for feature extraction to obtain vocabulary features;

and acquiring the second semantic label according to the matching result.

7. The speech semantic recognition method according to claim 6, wherein the step of obtaining the second semantic tag according to the matching result comprises:

8. The speech semantic recognition method according to claim 1, wherein the step of training a semantic recognition network according to the first semantic tag and the second semantic tag comprises:

9. The speech semantic recognition method of claim 1,

10. A speech semantic recognition system, comprising: