CN110880324A

CN110880324A - Voice data processing method and device, storage medium and electronic equipment

Info

Publication number: CN110880324A
Application number: CN201911053988.7A
Authority: CN
Inventors: 舒景辰; 张岱; 史彩庆; 谭星; 胡凯
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-03-13

Abstract

The embodiment of the application discloses a voice data processing method and device based on a conversation scene, a storage medium and electronic equipment, and belongs to the field of online education. The method comprises the following steps: setting a conversation scene; collecting voice data input by a user in a conversation scene, and analyzing conversation content of the voice data; under the condition that the conversation content is not matched with the conversation scene, displaying first prompt information; wherein the first prompt information indicates that the dialogue content of the voice data does not match the dialogue scene. The application can realize unsupervised autonomous learning, reduce the labor cost and improve the learning efficiency.

Description

Voice data processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of online education, and in particular, to a method and an apparatus for processing voice data based on a dialog scenario, a storage medium, and an electronic device.

Background

With the development of the internet, online education is popular with more and more people, online education and scientific research is not limited to time and flexible in place, and skills of the online education and scientific research are fully improved. Compared with the traditional fixed classroom, the mobile classroom is more mobile and convenient, and the visual classroom has more visualization and more attractive in pictures and audio.

In the related art, the method for learning the language by the user uses a dialogue scene based method, the user carries out multiple rounds of dialogue with the teacher in a certain set dialogue scene, and the teacher corrects wrong dialogue of the student in the dialogue process, so that the learning mode needs the teacher to supervise in real time and consumes a large amount of labor cost.

Disclosure of Invention

The voice data processing method, the voice data processing device, the voice data storage medium and the voice data processing terminal based on the conversation scene can solve the problem that efficiency of manually correcting conversation contents of a user is low, and unsupervised autonomous learning is achieved. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for processing voice data based on a dialog scenario, where the method includes:

setting a conversation scene;

collecting voice data input by a user in a conversation scene, and analyzing conversation content of the voice data;

under the condition that the conversation content is not matched with the conversation scene, displaying first prompt information; wherein the first prompt information indicates that the dialogue content of the voice data does not match the dialogue scene.

In a second aspect, an embodiment of the present application provides a speech data processing apparatus based on a dialog scenario, where the speech data processing apparatus based on a dialog scenario includes:

a setting unit for setting a dialog scene;

the system comprises a collecting unit, a processing unit and a processing unit, wherein the collecting unit is used for collecting voice data input by a user in a conversation scene and analyzing the conversation content of the voice data;

the prompting unit is used for displaying prompting information under the condition that the conversation content is not matched with the conversation scene; wherein the first prompt information indicates that the dialogue content of the voice data does not match the dialogue scene.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

the method comprises the steps of collecting voice data input by a user under a preset conversation scene, analyzing conversation content of the voice data, judging whether the conversation content is matched with the conversation scene or not, and displaying unmatched prompt information under the unmatched condition, so that the user is prompted to send out correct conversation content, autonomous learning without manual supervision is achieved, and learning efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a diagram of a network architecture provided by an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for processing speech data based on a dialog scenario according to an embodiment of the present application;

FIG. 3 is another schematic flow chart of a speech data processing method based on dialog scenarios according to an embodiment of the present application;

FIG. 4 is another schematic flow chart of a speech data processing method based on dialog scenarios provided in an embodiment of the present application;

FIG. 5 is another schematic flow chart of a speech data processing method based on dialog scenarios according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 shows an exemplary system architecture 100 of a dialog context-based speech data processing method or a dialog context-based speech data processing apparatus, which can be applied to the present application.

As shown in fig. 1, the system architecture 100 may include a first terminal device 100, a first network 101, a server 102, a second network 103, and a second terminal device 104. The first network 104 is used to provide a medium for a communication link between the first terminal device 101 and the server 102, and the second network 103 is used to provide a medium for a communication link between the second terminal device 104 and the server 102. The first network 101 and the second network 103 may include various types of wired or wireless communication links, such as: the wired communication link includes an optical fiber, a twisted pair wire, or a coaxial cable, and the WIreless communication link includes a bluetooth communication link, a WIreless-FIdelity (Wi-Fi) communication link, or a microwave communication link, etc.

The first terminal device 100 communicates with the second terminal device 104 through the first network 101, the server 102, the second network 103, the first terminal device 100 sends a message to the server 102, the server 102 forwards the message to the second terminal device 104, the second terminal device 104 sends the message to the server 102, the server 102 forwards the message to the second terminal device 100, thereby realizing communication between the first terminal device 100 and the second terminal device 104, and the message type interacted between the first terminal device 100 and the second terminal device 104 includes control data and service data.

In the present application, the first terminal device 100 is a terminal for students to attend class, and the second terminal device 104 is a terminal for teachers to attend class; or the first terminal device 100 is a terminal for the teacher to attend class and the second terminal device 104 is a terminal for the student to attend class. For example: the service data is a video stream, the first terminal device 100 acquires a first video stream in the course of the student through the camera, the second terminal device acquires a second video stream in the course of the teacher through the camera 104, the first terminal device 100 sends the first video stream to the server 102, the server 102 sends the first video stream to the second terminal device 104, and the second terminal device 104 displays the first video stream and the second video stream on the interface; the second terminal device 104 sends the second video stream to the server 102, the server 102 forwards the second video stream to the first terminal device 100, and the first terminal device 100 displays the first video stream and the second video stream.

The class mode of the application can be one-to-one or one-to-many, namely one teacher corresponds to one student or one teacher corresponds to a plurality of students. Correspondingly, in the one-to-one teaching mode, a terminal used for a teacher to attend a class and a terminal used for a student to attend the class are communicated; in the one-to-many teaching method, one terminal for a teacher to attend a class and a plurality of terminals for students to attend a class are communicated with each other.

Various communication client applications may be installed on the first terminal device 100 and the second terminal device 104, for example: video recording application, video playing application, voice interaction application, search application, instant messaging tool, mailbox client, social platform software, etc.

The first terminal device 100 and the second terminal device 104 may be hardware or software. When the terminal devices 101 to 103 are hardware, they may be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the first terminal device 100 and the second terminal device 104 are software, they may be installed in the electronic devices listed above. Which may be implemented as multiple software or software modules (e.g., to provide distributed services) or as a single software or software module, and is not particularly limited herein.

When the first terminal device 100 and the second terminal device 104 are hardware, a display device and a camera may be further installed thereon, the display device may display various devices capable of implementing a display function, and the camera is used to collect a video stream; for example: the display device may be a Cathode ray tube (CR) display, a Light-emitting diode (LED) display, an electronic ink screen, a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), or the like. The user can view information such as displayed text, pictures, videos, and the like using the display devices on the first terminal device 100 and the second terminal device 104.

It should be noted that the voice data processing method based on the dialog scenario provided in the embodiment of the present application is generally executed by the server 102, and accordingly, the voice data processing apparatus based on the dialog scenario is generally disposed in the server 102 or the terminal device.

The server 102 may be a server that provides various services, and the server 102 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 102 is software, it may be implemented as a plurality of software or software modules (for example, for providing distributed services), or may be implemented as a single software or software module, and is not limited in particular herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks, and servers may be used, as desired for implementation.

The following describes in detail a speech data processing method based on dialog scenarios according to an embodiment of the present application with reference to fig. 2 to fig. 6. The speech data processing device based on the dialog scenario in the embodiment of the present application may be the electronic device shown in fig. 2 to 5.

Referring to fig. 2, a flow chart of a speech data processing method based on a dialog scenario is provided in an embodiment of the present application. As shown in fig. 2, the method of the embodiment of the present application may include the steps of:

s201, setting a conversation scene.

The dialog scenario represents an external environment where the interlocutor is located, and in the embodiment of the present application, the dialog scenario is an external environment that is set virtually, for example: the dialogue scene comprises a shopping scene, a boarding scene, a road asking scene and the like. The electronic equipment can set a conversation scene according to teaching requirements and can also set a user-defined conversation scene according to the selection of a user.

In one or more embodiments, the electronic device receives a dialog scene selection instruction of a user, sets a dialog scene based on the dialog scene selection instruction, and then displays scene information of the dialog scene, wherein the scene information can be represented in the form of pictures or videos. For example: when the user selects a shopping scene, the electronic device displays a picture or video of the shopping scene on the display screen.

S202, voice data input by a user in a conversation scene are collected, and conversation content of the voice data is analyzed.

The electronic equipment collects voice data input by a user in a dialogue scene through the audio collection device, the audio collection device converts voice sent by the user into a voice signal in an analog form, and then the voice signal in the analog form is preprocessed and converted into voice data in a digital form. The audio acquisition device can be a single microphone or a microphone array consisting of a plurality of microphones. The preprocessing process comprises the processes of filtering, amplifying, sampling, format conversion and the like. The dialog content of the speech data may be represented in the form of text, for example: the electronic device converts speech data into dialog contents in text form based on Hidden Markov Models (HMMs).

When the electronic equipment is terminal equipment, the electronic equipment directly collects voice data input by a user in a conversation scene through an audio collection device; when the electronic equipment is a server, the server receives voice data in the form of streaming media collected by terminal equipment of a user.

And S203, displaying the first prompt message when the conversation content and the conversation scene are not matched.

Wherein the first prompt information indicates that the dialogue content and the dialogue scene of the voice data do not match. The dialog content and the dialog scene do not match with each other, which means that the correlation between the dialog content and the dialog scene is not high, for example: the preset dialogue scene is a visiting zoo, and the dialogue content of the voice data input by the user in the dialogue scene is' what is eaten at breakfast? "conversation contents of voice data input by a user and scenes visiting a zoo are not highly relevant. In the embodiment of the present application, whether the dialog content of the voice data matches with the dialog scene may be measured in a quantitative manner. The language type of the dialog content may be chinese, english or other types of languages, and the embodiments of the present application are not limited.

When the electronic equipment is terminal equipment, the terminal equipment displays prompt information on a display screen; and when the electronic equipment is a server, the server generates prompt information and pushes the prompt to the terminal equipment of the user for displaying.

In one or more embodiments, a method of determining whether a dialog content of voice data and a dialog scene match includes:

extracting a first keyword set in the dialogue content of the voice data; acquiring a second keyword set associated with the conversation scene, and determining that the conversation content of the voice data is matched with the conversation scene when the number of common related keywords in the first keyword set and the second keyword set is greater than a preset number; or when the number of the common relevant key words in the first key word set and the second key word set is less than or equal to the preset number, determining that the conversation content of the voice data is not matched with the conversation scene.

The electronic equipment is pre-stored or pre-configured with a mapping relation between a conversation scene and a keyword set, and different conversation scenes correspond to different keyword sets. The electronic equipment extracts keywords in the voice data by using a keyword extraction algorithm to form a first keyword set, and acquires a second keyword set associated with the current conversation scene. The common keywords represent keywords present in both the first set of keywords and the second set of keywords.

acquiring reference dialogue content associated with the dialogue scene, and calculating the similarity between the dialogue content of the voice data and the reference dialogue content; if the similarity is larger than a preset threshold value, determining that the conversation content of the voice data is matched with the conversation scene; and if the similarity is smaller than or equal to the preset threshold, determining that the conversation content of the voice data is not matched with the conversation scene.

The electronic device prestores or preconfigures a mapping relation between conversation scenes and reference conversation contents, and different conversation scenes are associated with different reference conversation contents. The number of reference dialog contents associated with a dialog scene may be one or more. The calculated similarity may be based on euclidean distance, cosine distance, pearson similarity, or other algorithms. And under the condition that the number of the reference conversation contents associated with the current conversation scene is multiple, and the similarity between the conversation content of the voice data and any one of the reference conversation contents is greater than a preset threshold value, determining that the conversation content of the voice data is matched with the conversation scene.

acquiring a content matching degree evaluation model associated with a conversation scene; evaluating the dialogue content of the voice data based on a content matching degree evaluation model to obtain a score, and determining that the dialogue content and the dialogue scene of the voice data are not matched under the condition that the score is smaller than a preset score; and determining that the dialogue content and the dialogue scene of the voice data are matched in the case that the score is greater than or equal to the preset score.

The electronic equipment is pre-stored or pre-configured with a mapping relation between a conversation scene and a content matching degree evaluation model, and different conversation scenes are associated with different content matching degree evaluation models. The content matching degree evaluation model is trained by using voice data in a dialogue scene, and the content matching degree evaluation model is a machine learning model.

When the scheme of the embodiment of the application is executed, voice data input by a user in a preset conversation scene is collected, the conversation content of the voice data is analyzed, whether the conversation content is matched with the conversation scene is judged, and unmatched prompt information is displayed under the unmatched condition, so that the user is prompted to send out correct conversation content, the autonomous learning without manual supervision is realized, and the learning efficiency is improved.

Referring to fig. 3, a flowchart of a speech data processing method based on a dialog scenario is provided in an embodiment of the present application. The present embodiment is exemplified by applying the speech data processing method based on the dialog scenario to the electronic device, and the electronic device may be a server or a terminal device. The voice data processing method based on the dialogue scene comprises the following steps:

s301, setting a conversation scene.

The dialog scenario represents an external environment where the interlocutor is located, and the dialog scenario in the embodiment of the present application is an external environment that is set virtually. For example: the conversation scene comprises a shopping scene, a boarding scene, a zoo scene and the like. The electronic equipment can set a conversation scene according to teaching requirements and can also set the conversation scene according to the selection of a user.

In one or more embodiments, the electronic device receives a dialog scene selection instruction of a user, sets a dialog scene based on the dialog scene selection instruction, and then displays scene information of the dialog scene, wherein the scene information can be represented in the form of pictures or videos.

In one or more embodiments, the electronic device is provided with a touch display screen, the electronic device displays a plurality of conversation scenes, the user selects one of the conversation scenes based on touch operation, the electronic device acquires a picture associated with the selected conversation scene based on the conversation scene selected by the user, and the picture is displayed.

And S302, collecting voice data input by a user in a conversation scene.

The electronic device pre-stores or is pre-configured with duration of a dialog scene, the duration can be represented by a start time and an end time, and the electronic device collects voice data input by a user in the dialog scene within the duration. The electronic equipment collects voice data input by a user in a conversation scene through the audio collection device, the audio collection device converts voice sent by the user into voice data in an analog form, and then the voice data in the analog form is preprocessed to obtain voice data in a digital form. The audio acquisition device can be a single microphone or a microphone array consisting of a plurality of microphones. The preprocessing process comprises the processes of filtering, amplifying, sampling, format conversion and the like.

S303, analyzes the dialogue content of the voice data.

The dialogue content of the voice data can be represented in a text form, the electronic device can convert the voice data into the dialogue content in the text form based on the HMM, and the dialogue content comprises a plurality of keywords.

S304, extracting a first keyword set in the dialogue content of the voice data.

The electronic equipment extracts keywords from a text of the conversation content to obtain a first keyword set, wherein the first keyword set comprises one or more keywords. Wherein: the electronic device may extract a first set of keywords in the text of the dialog content using a keyword extraction algorithm such as term-inverse document frequency (TF-IDF), TextRank, Rake, Topic-Model, and so on.

For example: the text of the dialog content is "what is the weight of the elephant? ", the keywords extracted by the electronic device based on the TF-IDF are" elephant "and" weight ".

S305, acquiring a second keyword set associated with the conversation scene.

The electronic device prestores or is preconfigured with a second keyword set associated with the conversation scene, different conversation scenes have different second keyword sets, and the second keyword set comprises a plurality of keywords.

S306, counting the number of the common relevant key words in the first key word set and the second key word set.

The common keywords are keywords existing in both the first keyword set and the second keyword set, for example: the first keyword set comprises the following keywords: the second keyword set comprises keywords A, keywords B and keywords C, and the keywords in the second keyword set are: and determining that the keywords A and the keywords B are common keywords by comparing the first keyword set with the second keyword set, wherein the number of the common keywords is 2.

And S307, judging whether the number is larger than the preset number.

The electronic device is pre-stored or pre-configured with a preset number, and the preset number may be determined according to actual needs, which is not limited in the embodiments of the present application. When the electronic device determines that the number of the common relevant keywords in the first keyword set and the second keyword set is greater than the preset number, the voice data input by the user in the dialog scene is matched with the dialog scene, and S308 is executed; if the number of the common related keywords in the first keyword set and the second keyword set is less than or equal to the preset number, the voice data input by the user in the dialog scene is not matched with the dialog scene, and S309 is executed.

And S308, displaying the first prompt message.

Wherein the first prompt information is used for indicating that the dialogue content of the voice data input by the user in the dialogue scene and the dialogue scene are not matched. Furthermore, the electronic equipment can also display a second keyword set associated with the conversation scene, so that the user generates correct conversation content according to the prompt of the second keyword set.

For example: the conversation scene is a shopping scene, the electronic device displays a background picture of a supermarket, and the conversation content of the voice data sent by the user in the conversation scene is' how old that girl? The electronic equipment extracts a first keyword set of conversation content, acquires a second keyword set associated with a shopping scene, compares the first keyword set with the second keyword set, and if the comparison result shows that the number of common related keywords in the first keyword set and the second keyword set is less than or equal to a preset number, the first prompt information displayed by the electronic equipment is a red X pattern, and meanwhile, the electronic equipment displays keywords in the second keyword set.

S309, displaying the second prompt message.

Wherein the second prompt information indicates that the dialogue content of the voice data input by the user in the dialogue scene and the dialogue scene are matched.

For example: the conversation scene is a zoo scene, the electronic device displays a background picture of the zoo, and the conversation content of the voice data of the user in the conversation scene is "how much the weight of the elephant? And when the electronic equipment judges that the conversation content is matched with the zoo scene, the displayed second prompt message is a green thumb pattern.

According to the embodiment of the application, the voice data input by the user in the preset dialogue scene is collected, the dialogue content of the voice data is analyzed, whether the dialogue content is matched with the dialogue scene is judged according to the number of the common relevant key words in the dialogue content and the key words in the reference dialogue content, and unmatched prompt information is displayed under the unmatched condition, so that the user is prompted to send out correct dialogue content, autonomous learning without manual supervision is achieved, and the learning efficiency is improved.

Referring to fig. 4, a flowchart of a speech data processing method based on a dialog scenario provided in the embodiment of the present application is schematically shown. As shown in fig. 4, the method of the embodiment of the present application may include the steps of:

s401, setting a conversation scene in response to the input conversation scene selection instruction.

The electronic device is pre-stored or pre-configured with a plurality of conversation scenes, and the conversation scene selection instruction is used for selecting one conversation scene from the plurality of conversation scenes. For example: the conversation scene comprises a shopping scene, a boarding scene, a zoo scene and the like. The dialog scenario selection instruction is triggered based on user actions of types including, but not limited to: touch control operation, mouse operation, key operation, voice control operation, body sensing operation and the like.

S402, obtaining scene information associated with the conversation scene, and displaying the scene information.

The electronic device is pre-stored or pre-configured with scene information, and the scene information may be one or more of pictures, texts and videos for describing a conversation scene. Different dialog scenes are associated with different scene information. The electronic device displays the scene information as a background.

For example: the electronic device prestores or is preconfigured with the relationship between the dialog scene and the scene information as follows: the picture 1 is associated with the dialog scene 1, the picture 2 is associated with the dialog scene 2, and the picture 3 is associated with the diagram 3.

And S403, collecting voice data input by a user in a conversation scene.

The electronic equipment can use an audio acquisition device to acquire voice data input by a user in the conversation scene within the duration, convert voice emitted by the user into voice data in an analog form, and then preprocess the voice data in the analog form to obtain voice data in a digital form. The audio acquisition device can be a single microphone or a microphone array consisting of a plurality of microphones. The preprocessing process comprises the processes of filtering, amplifying, sampling, format conversion and the like.

S404, analyzing the dialogue content of the voice data.

The electronic device can convert the voice data into the dialogue content in the text form based on the HMM, and the dialogue content comprises a plurality of keywords.

For example: the text of the dialog content is "how fast is the cheetah? ", the keywords extracted by the electronic device based on the TF-IDF are" cheetah "and" speed ".

S405, obtaining reference conversation content related to the conversation scene.

The electronic device is pre-stored or pre-configured with reference dialog contents, the reference dialog contents include a plurality of dialog contents, different dialog scenes are associated with different reference dialog contents, and the reference dialog contents can also be represented in a text form.

For example: the corresponding relationship between the pre-stored or pre-configured dialog scene and the reference dialog content of the electronic device is as follows: the dialogue scene 1 is associated with the reference dialogue content 1, the dialogue scene 2 is associated with the reference dialogue content 2, and the dialogue scene 3 is associated with the reference dialogue content 3.

And S406, calculating the similarity between the conversation content of the voice data and the reference conversation content.

The dialogue content of the voice data and the reference dialogue content are expressed in a text form, the reference dialogue content comprises a plurality of dialogue contents, and the similarity is calculated between the dialogue contents of the voice data and each dialogue content in the reference dialogue contents.

In one or more embodiments, the electronic device can utilize cosine similarity to calculate similarity between the dialog content of the voice data and the reference dialog content. Firstly, the electronic equipment carries out word segmentation on the conversation content of the voice data and counts the occurrence frequency of each keyword in the conversation content; and performing word segmentation on the reference conversation content, and counting the occurrence times of each keyword in the reference conversation content. Then, calculating cosine values based on cosine formulas according to the occurrence times of the keywords in the conversation contents and the occurrence times of the keywords in the reference conversation contents, wherein the more the cosine values are close to 1, the more the conversation contents representing the voice data are similar to the reference conversation contents; the closer the cosine value is to 0, the more dissimilar the contents of the dialogue representing the voice data and the contents of the reference dialogue.

In one or more embodiments, the electronic device may evaluate a similarity value between the dialogue content of the voice data and the reference dialogue content using a simple common word method. The electronic equipment counts one or more common keywords (common words) between the conversation content of the voice data and the reference conversation content, then determines the length of one or more common words, and divides the length of one or more common times by the longer conversation content to obtain the similarity.

For example: when the length of the common times between the conversation content a and the conversation content B is 4 and the maximum length of the conversation content a and the conversation content B is 6, the similarity between the conversation content a and the conversation content B is 4/6-0.667.

In one or more embodiments, the electronic device determines a similarity between the conversation content of the voice data and the reference conversation content based on an edit distance (edit distance). The edit distance represents the minimum number of edit operations required to convert one character string into another character string between the two character strings. The editing operation here includes replacing a character, inserting a character, and deleting a character. The smaller the edit distance, the greater the similarity between two character strings.

In one or more embodiments, the electronic device can determine a similarity between the conversation content of the speech data and the reference conversation content based on the hamming distance. The electronic device converts the dialogue contents of the voice data and the reference dialogue contents into 64-bit binary numbers based on a hash algorithm, and then compares hamming distances between the two binary numbers to determine a similarity.

And S407, judging whether the similarity is greater than a preset threshold value.

The electronic device prestores or is preconfigured with a preset threshold, the preset threshold can be determined according to actual requirements, the embodiment of the application is not limited, and when the electronic device determines that the similarity between the conversation content of the voice data and the reference conversation content is less than or equal to the preset threshold, the conversation content of the voice data and the conversation scene are not matched, S408 is executed; when the electronic device determines that the similarity between the dialogue content of the voice data and the reference dialogue content is greater than the preset threshold, the dialogue content of the voice data and the dialogue scene are matched, and S409 is executed.

And S408, displaying the first prompt message.

Wherein the first prompt information is used for indicating that the dialogue content of the voice data input by the user in the dialogue scene and the dialogue scene are not matched. Further, the electronic device can also display the keywords of the reference dialogue content associated with the dialogue scene, so that the user can generate the correct dialogue content according to the prompt of the keywords of the reference dialogue content.

And S409, displaying the second prompt message.

By implementing the embodiment of the application, the voice data input by the user in the preset dialogue scene is collected, the dialogue content of the voice data is analyzed, whether the dialogue content is matched with the dialogue scene is judged according to the similarity between the dialogue content and the reference dialogue content, and unmatched prompt information is displayed under the unmatched condition, so that the user is prompted to send out correct dialogue content, autonomous learning without manual supervision is realized, and the learning efficiency is improved.

Referring to fig. 5, a schematic flowchart of a speech data processing method based on dialog scenarios provided in an embodiment of the present application is shown, where the method in the embodiment of the present application may include the following steps:

s501, training a plurality of content matching degree evaluation models.

The electronic equipment prestores a plurality of training samples, different training samples correspond to different conversation scenes, each training sample comprises a plurality of conversation contents, and for each training sample, the electronic equipment performs machine learning based on the plurality of conversation contents in the training samples to obtain a content matching degree evaluation model. Each dialog scene corresponds to a content matching degree evaluation model. The content matching degree evaluation model is a machine learning model for evaluating whether the input dialogue content matches with the dialogue scene.

For example: the electronic device is pre-configured with 3 training samples: training sample 1, training sample 2, and training sample 3. Training sample 1 corresponds to dialog scenario 1, and training sample 1 includes a plurality of dialog contents that match dialog scenario 1. The training sample 2 corresponds to the dialogue scene 2, and the training sample 2 includes a plurality of dialogue contents matched with the dialogue scene 2. The training sample 3 corresponds to the dialogue scene 3, and the training sample 3 includes a plurality of dialogue contents matched with the dialogue scene 3.

And S502, setting a conversation scene in response to the input conversation scene selection instruction.

S503, acquiring scene information associated with the conversation scene, and displaying the scene information.

The electronic device is pre-stored or pre-configured with scene information, and the scene information may be one or more of pictures, texts and videos for describing a conversation scene. Different dialog scenes are associated with different scene information, and the electronic equipment can display the scene information as a background.

And S504, voice data input by the user in the conversation scene is collected.

The electronic equipment pre-stores or pre-configures duration of a conversation scene, wherein the duration can be represented by a start time and an end time, the electronic equipment uses an audio acquisition device to acquire voice data input by a user in the conversation scene within the duration, the audio acquisition device converts voice emitted by the user into voice data in an analog form, and then the voice data in the analog form is preprocessed to obtain voice data in a digital form. The audio acquisition device can be a single microphone or a microphone array consisting of a plurality of microphones. The preprocessing process comprises the processes of filtering, amplifying, collecting, format conversion and the like.

And S505, analyzing the dialogue content of the voice data.

And S506, acquiring a content matching degree evaluation model associated with the conversation scene.

The electronic device pre-stores or is pre-configured with a plurality of content matching degree evaluation models, and the electronic device determines a corresponding content matching degree evaluation model from the plurality of content matching degree evaluation models based on the dialog scene selected in S502.

And S507, generating a feature vector according to the conversation content of the voice data.

Wherein, the feature vector may be a text vector, and the electronic device may extract the feature vector in the dialog content based on a neural network, for example: feature vectors are generated based on the word2vector model.

And S508, evaluating the feature vectors based on the content matching degree evaluation model to obtain scores.

S509, judging whether the score is larger than a preset score.

The electronic device prestores or is preconfigured with a preset score, the preset score can be determined according to actual requirements, the embodiment of the application is not limited, and when the score input by the content matching degree model of the electronic device is less than or equal to the preset score, the conversation content and the conversation scene of the voice data are not matched, and S510 is executed; when the electronic device determines that the score input by the content matching degree evaluation model is greater than the preset threshold, the dialog content of the voice data and the dialog scene are matched, and S511 is executed.

And S510, displaying the first prompt message.

Wherein the first prompt information is used for indicating that the dialogue content of the voice data input by the user in the dialogue scene and the dialogue scene are not matched. Further, the electronic device may further display a second keyword set associated with the dialog scenario described in S401, so that the user generates correct dialog content according to a prompt of the second keyword set.

And S511, displaying the second prompt message.

By implementing the embodiment of the application, the voice data input by the user in the preset dialogue scene is collected, the dialogue content of the voice data is analyzed, whether the dialogue content is matched with the dialogue scene is judged through the content matching degree model, and unmatched prompt information is displayed under the unmatched condition, so that the user is prompted to send out correct dialogue content, the autonomous learning without manual supervision is realized, and the learning efficiency is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 6, a schematic structural diagram of a speech data processing apparatus based on dialog scenarios according to an exemplary embodiment of the present application is shown. Hereinafter referred to as the apparatus 6, the apparatus 6 may be implemented as all or a part of the terminal by software, hardware or a combination of both. The device 6 comprises a setting unit 601, a collecting unit 602 and a prompting unit 603.

A setting unit 601, configured to set a dialog scenario.

The collecting unit 602 is configured to collect voice data input by a user in a dialog scenario, and parse dialog contents of the voice data.

A presentation unit 603 configured to display presentation information when the dialog content and the dialog scene do not match; wherein the first prompt information indicates that the dialogue content of the voice data does not match the dialogue scene.

In one or more embodiments, the apparatus 6 further comprises:

the matching unit is used for extracting a first keyword set in the dialogue content of the voice data;

acquiring a second keyword set associated with the conversation scene;

when the number of common key words in the first key word set and the second key word set is larger than a preset number, determining that the conversation content of the voice data is matched with the conversation scene; or

And when the number of the same keywords in the first keyword set and the second keyword set is less than or equal to a preset number, determining that the voice data and the reference voice data are not matched.

In one or more embodiments, the first prompt further includes: the second set of keywords.

In one or more embodiments, the apparatus 6 further comprises:

a matching unit for acquiring a reference dialogue content associated with the dialogue scene;

calculating the similarity between the dialogue content of the voice data and the reference dialogue content;

if the similarity is larger than a preset threshold value, determining that the voice data is matched with the conversation scene;

and if the similarity is smaller than or equal to a preset threshold value, determining that the voice data and the conversation scene are not matched.

In one or more embodiments, the first prompt further includes: the dialog content of the reference speech data.

In one or more embodiments, the apparatus 6 further comprises:

the matching unit is used for acquiring a content matching degree evaluation model associated with the conversation scene;

generating a feature vector according to the dialogue content of the voice data;

evaluating the feature vectors based on the content matching degree evaluation model to obtain scores;

and determining that the dialogue content of the voice data and the dialogue scene are not matched under the condition that the score is smaller than a preset score.

In one or more embodiments, the setting unit 601 is specifically configured to:

setting a dialog scene in response to an input dialog scene selection instruction;

acquiring scene information associated with the dialog scene, and displaying the scene information.

It should be noted that, when the apparatus 6 provided in the foregoing embodiment executes the voice data processing method based on the dialog scenario, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed to different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the embodiments of the speech data processing method based on the dialog scenario provided in the above embodiments belong to the same concept, and details of the implementation process are referred to in the embodiments of the method, which are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The device 6 collects voice data input by a user in a preset dialogue scene, analyzes the dialogue content of the voice data, judges whether the dialogue content is matched with the dialogue scene or not, and displays unmatched prompt information under the unmatched condition, so that the user is prompted to send correct dialogue content, the autonomous learning without manual supervision is realized, and the learning efficiency is improved.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps in the embodiments shown in fig. 2 to 5, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 2 to 5, which are not described herein again.

The present application further provides a computer program product, which stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for processing voice data based on dialog scenarios as described in the above embodiments.

Fig. 7 is a schematic structural diagram of a speech data processing apparatus based on a dialog scenario according to an embodiment of the present application, which is hereinafter referred to as an apparatus 7 for short, where the apparatus 7 may be integrated in the aforementioned server or terminal device, as shown in fig. 7, the apparatus includes: memory 702, processor 701, input device 703, output device 704, and a communication interface.

The memory 702 may be a separate physical unit, and may be connected to the processor 701, the input device 703, and the output device 704 through a bus. The memory 702, processor 701, input device 703, and output device 704 may also be integrated, implemented in hardware, etc.

The memory 702 is used for storing a program for implementing the above method embodiment, or various modules of the apparatus embodiment, and the processor 701 calls the program to perform the operations of the above method embodiment.

Input devices 702 include, but are not limited to, a keyboard, a mouse, a touch panel, a camera, and a microphone; the output device includes, but is not limited to, a display screen.

Communication interfaces are used to send and receive various types of messages and include, but are not limited to, wireless interfaces or wired interfaces.

Alternatively, when part or all of the distributed task scheduling method of the above embodiments is implemented by software, the apparatus may also include only a processor. The memory for storing the program is located outside the device and the processor is connected to the memory by means of circuits/wires for reading and executing the program stored in the memory.

The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above.

Wherein, the processor 701 calls the program code in the memory 702 for executing the following steps:

setting a conversation scene;

displaying first prompt information on a display under the condition that the conversation content and the conversation scene do not match; wherein the first prompt information indicates that the dialogue content of the voice data does not match the dialogue scene.

In one or more embodiments, processor 701 is further configured to: extracting a first keyword set in the dialogue content of the voice data;

acquiring a second keyword set associated with the conversation scene;

In one or more embodiments, processor 701 is further configured to:

acquiring reference conversation content associated with the conversation scene;

In one or more embodiments, processor 701 is further configured to:

acquiring a content matching degree evaluation model associated with the conversation scene;

In one or more embodiments, the processor 701 executing the setting dialog scenario includes:

obtaining scene information associated with the dialog scene, and displaying the scene information on a display.

The embodiment of the present application further provides a computer storage medium, which stores a computer program, where the computer program is used to execute the speech data processing method based on the dialog scenario provided in the foregoing embodiment.

The embodiment of the present application further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method for processing speech data based on dialog scenarios provided in the foregoing embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for processing speech data based on dialog scenarios, the method comprising:

setting a conversation scene;

collecting voice data input by a user in the conversation scene, and analyzing the conversation content of the voice data;

2. The method according to claim 1, wherein before displaying the first prompt message in the case that the dialog content does not match the dialog scene, the method further comprises:

extracting a first keyword set in the dialogue content of the voice data;

acquiring a second keyword set associated with the conversation scene;

Determining that the voice data and the reference voice data are not matched when the number of the common relevant key words in the first keyword set and the second keyword set is less than or equal to a preset number.

3. The method of claim 2, wherein the first prompt further comprises: the second set of keywords.

4. The method according to claim 1, wherein before displaying the first prompt message in the case that the dialog content does not match the dialog scene, the method further comprises:

5. The method of claim 3, wherein the first prompt further comprises: the dialog content of the reference speech data.

6. The method according to claim 1, wherein before displaying the first prompt message in the case that the dialog content does not match the dialog scene, the method further comprises:

7. The method of any one of claims 1 to 4, wherein the setting up a dialog scenario comprises:

8. A speech data processing apparatus based on dialog scenarios, the apparatus comprising:

a setting unit for setting a dialog scene;

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.

10. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 7.