WO2021159734A1

WO2021159734A1 - Data processing method and apparatus, device, and medium

Info

Publication number: WO2021159734A1
Application number: PCT/CN2020/124256
Authority: WO
Inventors: 王锁平; 周登宇; 乔磊; 曹传兴
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-09-22
Filing date: 2020-10-28
Publication date: 2021-08-19
Also published as: CN112131365A

Abstract

Disclosed in embodiments of the present application are a data processing method, an apparatus, a device, and a medium, relating to voice processing technology in artificial intelligence, and applicable to a blockchain network. The method comprises: receiving a service handling request sent by a client; obtaining first semantic information from the first voice data and determining, on the basis of the first semantic information, second voice data matching said first semantic information; determining, from a video database, second video data matching the second voice data; combining the second video data and the second voice data to obtain synthesized data and sending the synthesized data to the client; according to second multimedia data returned by the client, performing service processing. The second multimedia data is reply data returned by the client in respect of the synthesized data. Using embodiments of the present invention improves accuracy of voice recognition and enhances user experience.

Description

Data processing method, device, equipment and medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 22, 2020, the application number is 202011006333.7, and the invention title is "a data processing method, device, equipment and medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to voice processing technology in artificial intelligence, and in particular to a data processing method, device, equipment, and medium.

Background technique

Traditional business processing is generally a mode of manual business processing, because the manual business processing method cannot achieve all-weather business processing, and the cost of this method is relatively high. Therefore, the way of handling business by video robot calls gradually replaces the way of handling business manually, and can realize business handling anytime, anywhere.

The inventor realized that the current video robot calls are generally relatively fixed, and video robots can only recognize fixed words and respond accordingly, resulting in insufficient semantic information recognized and low voice recognition accuracy. Therefore, the accuracy of video robot calls is relatively high. Low, poor user experience.

Summary of the invention

The embodiments of the present application provide a data processing method, device, equipment, and medium, which can improve the accuracy of voice recognition, thereby improving the accuracy of video robot calls and improving user experience.

One aspect of the embodiments of the present application provides a data processing method, including:

Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;

Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is reply data to the first voice data, the The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;

Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;

Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. The media data is the reply data of the client for the composite data.

One aspect of the embodiments of the present application provides a data processing device, including:

The data acquisition module is used to receive a business processing request sent by the client, the business processing request includes the first voice data and the first video data;

The voice matching module is used to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice Data reply data, where the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;

A video matching module, configured to determine second video data matching the second voice data from a video database, and the video database includes a plurality of video data;

The service processing module is used to synthesize the second video data and the second voice data to obtain synthetic data, send the synthetic data to the client, and perform services according to the first multimedia data replies from the client Processing, the first multimedia data is the reply data of the client for the composite data.

One aspect of this application provides a computer device, including: a processor, a memory, and a network interface;

The above-mentioned processor is connected to a memory and a network interface, wherein the network interface is used to provide data communication functions, the above-mentioned memory is used to store computer programs, and the above-mentioned processor is used to call the above-mentioned computer programs to execute the following methods:

One aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the following method :

The embodiments of the present application enable higher accuracy of voice recognition, thereby improving the accuracy of video robot calls, making human-computer interaction more natural, thereby increasing the interest of human-computer interaction and improving user experience.

Description of the drawings

FIG. 1 is a schematic diagram of the architecture of an information processing system provided by an embodiment of the present application;

2 is a schematic flowchart of a data processing method provided by an embodiment of the present application;

FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application;

4 is a schematic diagram of the composition structure of a data processing device provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the composition structure of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.

The technical solution of this application can be applied to the fields of artificial intelligence, smart city, blockchain and/or big data technology. Optionally, the data involved in this application, such as synthetic data and/or multimedia data, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.

Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

Among them, the key technologies of speech processing technology (Speech Technology) include automatic speech recognition technology (Automated Speech Recognition, ASR), speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.

This application involves voice processing technology in artificial intelligence, and the technical solution of this application is suitable for remote business processing scenarios such as remote face-to-face review, video return visits, and remote account opening. Use voice processing technology to obtain first semantic information from the first voice data, determine second voice data matching the first semantic information according to the first semantic information, and determine second video data matching the second voice data from the video database , Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. Since the first semantic information is obtained from the first voice data in this application, each keyword in the text data corresponding to the first voice data and the relationship between each keyword are combined to obtain the first semantic information, so the first semantic information is obtained. A semantic information can more accurately reflect the meaning of the first voice data, that is, the accuracy of voice recognition is higher, and the second voice data can be more accurately responded to according to the first voice data. This application can be applied to the fields of smart government affairs, smart education, etc., and is conducive to promoting the construction of smart cities.

Please refer to FIG. 1. FIG. 1 is a schematic diagram of the architecture of an information processing system provided by an embodiment of the present application. The schematic diagram of the system architecture includes a client 11 and a business processing server 12 corresponding to a business processing platform. Among them, the client 11 may refer to a terminal that sends a service processing request; the service processing server 12 may refer to a back-end service device that performs information processing, and the information processing may include acquiring first semantic information in the first voice data, and The semantic information determines the second voice data that matches the first semantic information, determines the second video data that matches the second voice data from the video database, and synthesizes the second video data and the second voice data to obtain synthesized data ,and many more.

The business processing server 12 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or it may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms. When the business processing server 12 is an independent physical server, the server can independently perform information processing; when the business processing server 12 is multiple physical servers, multiple physical servers can cooperate to perform information processing. For example, one server can obtain the first semantic information in the first voice data, the other server can determine the second voice data matching the first semantic information according to the first semantic information, and the other server can compare the second video data with the first semantic information. Second, the speech data is synthesized and processed to obtain synthesized data, and so on. The client 12 can be a computer device, including a mobile phone, a tablet computer, a notebook computer, a handheld computer, a smart speaker, a mobile internet device (MID), a POS (Point Of Sales) machine, and a wearable device (such as Smart watches, smart bracelets, etc.) etc. The number of clients 12 may be one or more. The embodiment of the present application is described with one client. For multiple clients, this method can be referred to for processing. It should be noted that the business processing can also be performed by the client, and the manner in which the client performs business processing can refer to the manner in which the business processing server performs business processing. The following uses the business processing server to perform business processing as an example for description.

In practical applications, for example, a user needs to handle a certain service. First, the user can send a service handling request to the service processing server through the client. The service handling request may include the first voice data and the first video data. Then, the service processing server receives the service processing request sent by the client, and obtains first semantic information from the first voice data. The first semantic information may include a service identifier, which may be, for example, a service name, a service code, and so on. In addition, the first semantic information may reflect at least two keywords in the first text data corresponding to the first voice data, and the connection between the at least two keywords. That is to say, the first semantic information is determined based on the connection between at least two keywords and at least two keywords. Then, the service processing server determines the second video data matching the second voice data from the video database. For example, the second voice data is voice data that instructs the user to perform a certain service processing process, and the second video data may be silent video data of a video robot simulating a human with a mouth opened, closed, eyes turned, and a smile on the face. Or, the second voice data is voice data indicating that the user's current network connection fails, and the second video data may be silent video data that the video robot simulates frustration, disappointment, etc. of a human. The service processing server synthesizes the silent second video data and the second voice data, obtains the synthesized data, and sends the synthesized data to the client. The user can see the video robot talking with him through the client. In this way, by synthesizing video data and voice data, real-time video calls can be made with users on the client side and the user experience can be improved. Since each keyword in the text data corresponding to the first voice data is extracted when the first semantic information in the first voice data is obtained, and the association relationship between each keyword is obtained, more accuracy can be achieved To determine the meaning of the first voice data, that is, the accuracy of voice recognition is higher, so as to achieve a more accurate reply to the second voice data, and to improve the accuracy and flexibility of the video robot call.

Please refer to FIG. 2, which is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 1, the method includes:

S101: Receive a service processing request sent by a client.

Among them, the service processing request includes the first voice data and the first video data. The first voice data is obtained through the client's voice recording of the user's voice, and the first video data is the user's posture, facial expression, Actions and other data are obtained by video recording. The client may refer to a terminal used by a user for service processing. The first voice data may include a service identifier, and the service identifier may include a service name, an abbreviation of the service name, a service code, etc., which are used to uniquely indicate the service. Business refers to the types of services that users need to handle, such as purchasing property insurance, bank loans, bank card processing, credit card processing, and so on. Alternatively, the business may also include services required by the user, such as bank card balance inquiry, credit card limit inquiry, and so on.

Optionally, the user may send a call request through the client, the service processing server obtains the call request, establishes a call connection with the client according to the call request, and receives the service processing request sent by the client through the call connection.

Here, the service processing server can correspond to multiple video extensions, and each video extension has the same function. The call request can include the video extension number, and the client will be assigned the video extension corresponding to the video extension number according to the video extension number. Realize the call connection with the client. If the video extension number is not included in the call request, an idle video extension can be assigned to the client. Specifically, the video extension can be matched to the call request sent by the client according to the waiting time of each idle video extension. For example, the idle video extension with the longest waiting time can be matched to the client, or the idle video extension with the shortest waiting time can be matched to the client. The video extension is matched to the client, and so on. By assigning idle video extensions to the client, the efficiency of establishing a call connection with the client can be improved, thereby improving the efficiency of subsequent business processing. Call connection includes video connection, voice connection, and so on. The video connection is used to obtain the video data sent by the client, and the voice connection is used to obtain the voice data sent by the client. Wherein, if the call connection is a voice connection, the client can send the first video data to the data processing server in other ways, for example, the first video data can be sent to the data processing server by video transmission.

In specific implementation, when the client establishes a call connection with the service processing server, the client can send a call request to the service processing server, and the call request can be verified through network switching, firewall, etc., such as verifying whether the call request is safe or not. Carrying viruses, and whether the call request is in a format that can be recognized by the service processing server, etc., establish a call connection with the service processing server after the verification is passed.

Optionally, after the client establishes a call connection with the service processing server, the service processing server may send pre-stored voice data indicating that the user is welcome to the client, and the user may determine that the call connection is established successfully based on the voice data, and then send Business handling request. For example, the pre-stored voice data indicating that the user is welcome can be "Hello, welcome to call, what can I do for you". Optionally, in the case of saving storage space, the text data indicating that the user is welcome can be stored in the service processing server, and the service processing server converts the text data into voice data, and then sends the voice data to the client , Since the storage space occupied by text data is less than the storage space occupied by voice data, the way of storing text data can save storage space.

S102: Acquire first semantic information from the first voice data, and determine, according to the first semantic information, second voice data that matches the first semantic information.

Wherein, the second voice data is reply data for the first voice data, and the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the difference between the at least two keywords connection relation. By acquiring at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, compared to determining the meaning of the first voice data only based on keywords, It not only combines keywords, but also combines the association relationship between each keyword, which can realize more accurate determination of the meaning of the first speech data, and the determined first semantic information is more accurate, that is, the accuracy of speech recognition is higher. The second voice data obtained by the matching is more accurate.

Optionally, a method for obtaining the first semantic information from the first voice data may be: converting the first voice data to obtain the first text data corresponding to the first voice data; and applying keywords to the first text data Extract to obtain at least two keywords; obtain word meaning information of at least two keywords; obtain word combinations according to at least two keywords, obtain combined word meaning information of word combinations; determine at least two keywords based on word meaning information and combined word meaning information The association relationship between the at least two keywords and the association relationship between the at least two keywords to determine the semantic information corresponding to the first text data as the first semantic information.

Here, since the first voice data is voice-type data, the voice-type data can be converted into text-type data to obtain the first text data. By extracting keywords from the first text data, at least two keywords are obtained. Optionally, the business processing platform may perform word segmentation processing on the first text data, and divide the first text data into at least one word segmentation; obtain a stop word set, and the stop word set includes at least one word unrelated to the business; Search for a target word that matches the at least one participle in the word set; delete the target word in the at least one participle; perform keyword extraction on at least one participle after deleting the target word to obtain at least two keywords.

For example, the first text data is "I want to buy auto insurance, but I need a loan if I don’t have enough funds." The result of word segmentation processing is "I want to buy auto insurance but I don’t have enough funds and need a loan." The 10 participles are matched with each stop word in the stop word set. If the 4 participles of "I", "想", "but", and "need" are matched, these 4 participles will be deleted to obtain "Buy auto insurance funds before loan", extract keywords for "buy auto insurance funds before loan", get 6 keywords "buy", "auto insurance", "funds", "insufficient", "first", "loan" ". In specific implementation, ASR technology or other technologies can be used to convert voice data into text data, thereby extracting keywords in the text data.

After keyword extraction is performed on the first text data to obtain at least two keywords, the semantic information of these 6 keywords is obtained; a word combination is obtained based on the 6 keywords, for example, the word combination is "Buy car insurance funds, loan first" , Obtain the combined word meaning information of the word combination; determine the association relationship between at least two keywords according to the word meaning information of the 6 keywords and the combined word meaning information, and the meaning of the first text data can be obtained as "loan first, then buy auto insurance" ; Determine the semantic information corresponding to the first text data according to the at least two keywords and the association relationship between the at least two keywords, as the first semantic information, then the first semantic information is "loan, purchase car insurance."

Optionally, another method for obtaining the first semantic information from the first voice data may be: recognizing the first voice data to obtain at least two keywords in the first voice data. For example, the first voice data is: "I want to buy auto insurance, but I need a loan first if I don’t have enough funds." At least two keywords identified include "buy", "auto insurance", "funds", "insufficient", "first", For "loan", the first semantic information obtained is "loan, purchase auto insurance". Specifically, the first voice data may be converted into text data for keyword extraction according to specific requirements, or voice recognition may be performed on the first voice data to obtain at least two keywords. For example, if the cost of speech recognition is low, use speech recognition to save costs; or, if the accuracy of the conversion of speech data into text data for keyword extraction is high, then under the condition of improving the recognition accuracy , Use voice data to convert to text data for keyword extraction.

Optionally, the method for determining second voice data matching the first semantic information according to the first semantic information may include several steps:

1. Obtain the similarity between at least two keywords and multiple text data in the corpus to obtain multiple first similarities.

Here, the corpus is a database corresponding to the business processing server. The corpus can contain text data related to business processing, such as specific process information for business processing; it can also include text data that has nothing to do with business processing. In specific implementation, the similarity calculation method can be used to calculate the similarity between each keyword and each text data in the corpus to obtain multiple first similarities. The similarity calculation methods can include Pearson correlation coefficient method and Cosine similarity There are no restrictions on the degree of law, etc. here.

2. The text data corresponding to the association relationship is determined according to the association relationship between at least two keywords, and the similarity between the text data corresponding to the association relationship and the multiple text data is obtained, and multiple second similarities are obtained.

Here, since the association relationship between at least two keywords refers to the relationship between at least two keywords, for example, it can indicate the order of the businesses corresponding to the keywords, etc., so the corresponding relationship can be determined according to the relationship. Text data, such as at least two keywords including "buy", "car insurance", "funds", "insufficient", "first", and "loan". Without considering the relationship between the keywords, the obtained contains The text data of the business is "purchase auto insurance loan". In the case of considering the relationship between keywords, the obtained text data corresponding to the association relationship of the business is "purchase auto insurance loan", that is, the loan business is processed first, and then processed Buy auto insurance business. In specific implementation, the similarity calculation method can be used to calculate the similarity between the text data corresponding to the association relationship and the multiple text data in the corpus to obtain multiple second similarities. The similarity calculation method can include Pearson correlation coefficient Method, Cosine similarity method, etc., are not limited here.

3. Determine the text data corresponding to the maximum similarity among the plurality of first similarities as the first target text data, and determine the text data corresponding to the maximum similarity among the plurality of second similarities as the second target text data.

Here, since there are multiple text data in the corpus, the similarity between each keyword in the at least two keywords and each text data in the multiple text data in the corpus is calculated to obtain a first similarity Therefore, multiple first similarities can be obtained based on at least two keywords and multiple text data. For example, if the number of keywords is n1 and the number of text data in the corpus is m1, then n1*m1 first similarities can be calculated. Correspondingly, the number of text data corresponding to the association relationship is n2, and the number of text data in the corpus is n2. If the number is m1, n2*m1 second similarities can be calculated. Then you can compare the size of the n1*m1 first similarities, determine the maximum similarity among the n1*m1 first similarities, and compare the size of the n2*m1 second similarities, and determine the n2*m1 first similarities. The maximum similarity among the second similarities, so that the text data corresponding to the maximum similarity among the multiple first similarities is determined as the first target text data, and the text data corresponding to the maximum similarity among the multiple second similarities is determined Determined as the second target text data.

4. Determine the second text data according to the first target text data and the second target text data.

Here, if the first target text data is the same as the second target text data, the first target text data can be determined as the second text data; if the first target text data is different from the second target text data, the first target text data can be The second target text data is determined as the second text data.

5. Convert the second text data to obtain the voice data corresponding to the second text data as the second voice data.

In specific implementation, natural language processing technology (Natural Language Processing, NLP) may be used to process text data, obtain first semantic information, and determine second voice data matching the first semantic information. Here, TTS technology or other technologies can be used to convert text data into voice data. By converting the second text data into the second voice data, the user can obtain the second voice data through the client, and then reply the first multimedia data according to the second voice data, so as to realize the service processing platform according to the first reply of the user. Corresponding services for multimedia data processing. By converting the second text data into voice data, the user can learn the content of the voice data more intuitively. Compared with the way the user directly views the second text data, the way of converting the text data into voice data can improve the user The efficiency of viewing, thereby improving the efficiency of business processing.

S103: Determine second video data matching the second voice data from the video database.

Here, the video database includes multiple video data. For example, it can include multiple types of video data, specifically, it can include silent video data such as a video robot simulating a person’s mouth opening, closing, eye turning, and a smile, or a video robot simulating a person’s frustration, disappointment, etc. Video data, or silent video data including video robots simulating human expressions such as guilt, etc. In a specific implementation, the multiple types of video data can be stored in the video database in advance, so as to facilitate subsequent use.

Optionally, the method for determining the second video data matching the second voice data from the video database may be: obtaining the semantic scene of the second text data and the application scenario of each video data in the video database; Determine the target application scene matching the semantic scene; determine the video data corresponding to the target application scene as the second video data.

Among them, the second text data is text data corresponding to the second voice data, that is, the second text data and the second voice data have the same meaning, but the two types of data have different manifestations, and the second text data is in text form. , And the second voice data is in the form of sound. The semantic scenario of the second text data refers to the meaning of the second text data. For example, it can include specific business processing procedures, instructions to instruct users to wait, and troubleshooting prompt information, etc. Handling failures can include, for example, network connection failure, The server is busy and so on. The application scenario of the video data can be determined according to the type of the video data. For example, the semantic scenario is a specific business processing process, and the target application scenario that matches the semantic scenario may be silent video data such as a video robot simulating a human with a mouth opened, closed, eyes turned, and a smile on the face. Or the semantic scene is instruction information that instructs the user to wait, and the target application scene that matches the semantic scene may be silent video data in which the video robot simulates human expressions such as guilt. Or the semantic scene is the fault prompt information, and the target application scene that matches the semantic scene can be silent video data that the video robot simulates human frustration, disappointment, etc. By determining the target application scene matching the semantic scene, the video data corresponding to the target application scene is determined as the second video data, which can make the subsequent user see the synthesized data more natural and improve the interest of human-computer interaction.

S104: Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client.

Among them, the first multimedia data is the reply data of the client to the synthesized data. The business processing server sends the synthetic data to the client. After the user learns the synthetic data through the client, the user will respond according to the synthetic data, such as answering questions in the synthetic data, and filling in the user's information according to the prompt information in the synthetic data. Identity information, upload corresponding identity documents, etc. The client terminal obtains the first multimedia data by collecting the data that the user replies based on the synthesized data, for example, recording the user's reply voice, and recording the user's actions and expressions. The client sends the first multimedia data to the service processing server, and the service processing server performs service processing according to the first multimedia data.

Optionally, performing synthesis processing on the second video data and the second voice data to obtain the synthesized data may be: obtaining the voice duration of the second voice data; intercepting video data equal to the voice duration from the second video data, As the candidate video data; perform synthesis processing on the second voice data and the candidate video data to obtain synthesized data.

Here, if the video duration of the second video data is equal to the voice duration of the second voice data, the second video data is determined as the candidate video data. If the video duration of the second video data is greater than the voice duration of the second voice data, video data equal to the voice duration of the second voice data may be intercepted from the second video data as candidate video data. For example, the video duration of the second video data is 5 seconds, and the voice duration of the second voice data is 3 seconds, then 3 seconds of video data can be obtained from the 5 second second video data clock as candidate video data. If the video duration of the second video data is less than the voice duration of the second voice data, multiple second video data may be connected to obtain video data equal to the voice duration of the second voice data as candidate video data. For example, the video duration of the second video data is 3 seconds, and the voice duration of the second voice data is 6 seconds, then the same 3-second second video data can be connected to obtain 6-second video data as candidate video data. After the second voice data and the candidate video data are synthesized, the video data with sound may be obtained as the synthesized data, and the sound of the synthesized data is the second voice data. Or the second voice data and the candidate video data are sent to the client at the same time, and the client plays the second voice data and the candidate video data at the same time, and the synthesized data includes the second voice data and the candidate video data. It can be seen that when the service processing server sends voice data to the client, it will determine the video data matching the voice data from the video database, and then send the synthesized data to the client based on the voice data and the video data. The dialogue between clients makes human-computer interaction more natural.

In this embodiment of the application, by receiving the service processing request sent by the client, the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first The meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. The second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business. By processing and replying to the voice data sent by users in real time, the efficiency of business processing can be improved. By synthesizing the second voice data and the second video data and sending it to the client, the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.

Optionally, please refer to FIG. 3, which is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 3, the method includes:

S201: Receive a service processing request sent by a client.

S202: Acquire first semantic information from the first voice data, and determine, according to the first semantic information, second voice data that matches the first semantic information.

S203: Determine second video data matching the second voice data from the video database.

S204: Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, and send the synthesized data to the client.

Here, for the specific implementation of steps S201 to S204, reference may be made to the description of steps S101 to S104 in the embodiment corresponding to FIG. 2, which will not be repeated here.

S205: Intercept the first image of the user corresponding to the client from the first video data.

Here, the first video data is obtained through video recording of the user's posture, expression, and action data by the client. It can be seen that the first video data includes the facial image of the user. The service processing server may intercept the first video data every preset time to obtain the first image containing the user's face, that is, obtain the first image of the user corresponding to the client. For example, the image in the first video data may be intercepted every 0.5 seconds to obtain the first image. For example, if the duration of the first video data is 2 seconds, the number of first images of the user acquired is 4.

S206: Verify the legitimacy of the client according to the first image.

Here, verifying the legitimacy of the client according to the first image may refer to verifying whether the facial image of the user in the first image and the user image stored by the service processing server are the facial image of the same user, and if so, determining the legitimacy of the client. If not, it is determined that the client is not legal, and the adjustment information used to instruct the user to adjust the posture is sent to the client; the second multimedia data sent by the client for the adjustment information is obtained; and the third video data is intercepted The second image of the user; verify the legitimacy of the client according to the second image.

Wherein, the second multimedia data includes the third video data, and the facial information of the user stored by the service processing server may be based on the facial information stored by the user when the historical service is handled by the service processing server. For example, if the user has handled a bank card in the business processing server, the user's facial information stored by the business processing server may be the user's facial information reserved when the user has handled the bank card in the business processing server. If the user has not processed the historical business on the business processing server, or the user has not stored facial information when processing the historical business on the business processing server, the user’s facial information can be obtained from other servers that store the user’s facial information, for example, from the Ministry of Public Security , Ministry of Civil Affairs and other institutions to obtain the user’s facial information from the corresponding server. When the business processing server verifies that the user's facial image in the first image and the user image stored by the business processing server are not the same user's facial image, it determines that the client is not legal, and then sends adjustment information for instructing the user to adjust the posture To the client, so that the user can adjust the posture according to the adjustment information. For example, when the user's face is not aligned with the client's camera, the adjusted user's face is aligned with the client's camera; or, when the client's camera includes user A and user B, and user A is a user who needs to handle business , The adjusted client’s camera only includes user A.

Specifically, the service processing server obtains the second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data; obtains the user's second image according to the third video data; and obtains the user's second image from the third video data. Intercept the user's second image; verify the legitimacy of the client according to the second image. The second image includes the user’s facial image. If the second image and the user’s facial image stored in the service processing server are the same user’s facial image, the client has legitimacy, and the client will respond according to the first multimedia data Perform business processing. If the second image and the user's facial image stored in the business processing server are not the same user's facial image, the client has no legality and ends the business processing, and the output is used to instruct the user to perform the processing at the manual business processing office corresponding to the business processing server Business processing and termination of business processing.

S207: If the client has legitimacy, perform service processing according to the first multimedia data replies from the client.

Here, if the client has legitimacy, that is, when the second image and the user's facial image stored in the service processing server are the facial image of the same user, the service processing is performed according to the first multimedia data replies from the client.

In a possible implementation manner, before performing service processing, the service processing server may also perform secondary verification on the legitimacy of the client according to the first multimedia data, for example, may obtain the user from the first multimedia data. The third image when answering the question, for example, there are 3 questions in the composite data, and the image when the user answers the 3 questions is intercepted in the first multimedia data to obtain the third image containing the user's facial image. By performing micro-expression recognition on the third image, the authenticity of the question answered by the user is determined according to the micro-expression when the user answers the question. If the authenticity of the question answered by the user is determined to be high through micro-expression recognition, then the business is processed. If it is determined through the micro-expression recognition that the authenticity of the question answered by the user is low, then the instruction information for re-verifying the user's identity is sent or the question with abnormal micro-expression when the user answers the question is output again. If the second verification is passed or the user's facial expression when answering the question again indicates that the authenticity of the question answered by the user is high, then the service is processed. If the secondary verification fails or the facial expression when the user answers the question again indicates that the authenticity of the question answered by the user is low, the output is used to instruct the user to handle the business at the manual business processing office corresponding to the business processing server, and end the processing of the business . The micro-expression recognition can be used to verify the authenticity of the content of the user's speech, so that the accuracy of data recognition can be improved. By acquiring the first image in the first video data, and sending the first image to the service processing server for verification, the authenticity of the user's identity can be improved, and the service processing server can perform the verification on the third image in the first multimedia data. Micro-expression recognition can identify the authenticity of the question answered by the user, thereby realizing the second verification of the user's identity information and improving the accuracy of business processing.

In this embodiment of the application, before performing business processing, the client’s legitimacy is verified, that is, the user’s identity information is verified. If the client’s legitimacy is verified, it means that the user’s identity information is true. Corresponding business processing; in the case of verifying that the client is not legal, the user is prompted to adjust the posture by outputting adjustment information to realize the verification of the legitimacy of the client, which can improve the authenticity of user identity information verification, thereby improving business processing Accuracy.

The method of the embodiment of the present application is described above, and the device of the embodiment of the present application is described below.

4, FIG. 4 is a schematic diagram of the composition structure of a data processing device provided by an embodiment of the present application. The above data processing device may be a computer program (including program code) running in a computer device. For example, the data processing device is An application software; the device can be used to execute the corresponding steps in the method provided in the embodiments of this application. The device 40 includes:

The data acquisition module 401 is configured to receive a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;

The voice matching module 402 is configured to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first semantic information Voice data reply data, where the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;

The video matching module 403 is configured to determine second video data matching the second voice data from a video database, and the video database includes multiple video data;

The service processing module 404 is used to synthesize the second video data and the second voice data to obtain synthetic data, send the synthetic data to the client, and perform processing according to the first multimedia data replies from the client For service processing, the first multimedia data is the reply data of the client to the composite data.

Optionally, the voice matching module 402 is specifically used for:

Converting the first voice data to obtain first text data corresponding to the first voice data;

Performing keyword extraction on the first text data to obtain at least two keywords;

Obtain the word meaning information of the at least two keywords;

Obtain a word combination according to the at least two keywords, and obtain the combined word meaning information of the word combination;

Determine the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;

The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords as the first semantic information.

Optionally, the voice matching module 402 is specifically used for:

Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;

Determine the text data corresponding to the association relationship according to the association relationship between the at least two keywords, obtain the similarity between the text data corresponding to the association relationship and the plurality of text data, to obtain a plurality of second similarities;

Determining the text data corresponding to the maximum similarity in the plurality of first similarities as the first target text data, and determining the text data corresponding to the maximum similarity in the plurality of second similarities as the second target text data;

Determining the second text data according to the first target text data and the second target text data;

The second text data is converted, and the voice data corresponding to the second text data is obtained as the second voice data.

Optionally, the video matching module 403 is specifically used for:

Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;

Determine the target application scenario that matches the semantic scenario from the application scenario;

The video data corresponding to the target application scene is determined as the second video data.

Optionally, the business processing module 404 is specifically used for:

Acquiring the voice duration of the second voice data;

Intercepting video data equal to the voice duration from the second video data as candidate video data;

Perform synthesis processing on the second voice data and the candidate video data to obtain the synthesized data.

Optionally, the device 40 further includes: a legality verification module 405, configured to:

Intercept the first image of the user corresponding to the client from the first video data;

Verify the legitimacy of the client according to the first image;

If the client has legitimacy, the step of performing service processing according to the first multimedia data returned by the client is executed.

Optionally, the device 40 further includes: an information adjustment module 406, configured to:

If the client does not have legitimacy, sending adjustment information for instructing the user to adjust the posture to the client;

Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;

Intercept the second image of the user from the third video data;

Verify the legitimacy of the client according to the second image.

It should be noted that, for content not mentioned in the embodiment corresponding to FIG. 4, please refer to the description of the method embodiment, which will not be repeated here.

Referring to FIG. 5, FIG. 5 is a schematic diagram of the composition structure of a computer device provided by an embodiment of the present application. As shown in FIG. 5, the foregoing computer device 50 may include: a processor 501, a network interface 504, and a memory 505. In addition, the foregoing computer device 50 may also include: a user interface 503, and at least one communication bus 502. Among them, the communication bus 502 is used to implement connection and communication between these components. The user interface 503 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 503 may also include a standard wired interface and a wireless interface. The network interface 504 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 405 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 505 may also be at least one storage device located far away from the aforementioned processor 501. As shown in FIG. 5, the memory 505, which is a computer-readable storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 50 shown in FIG. 5, the network interface 504 can provide network communication functions; the user interface 503 is mainly used to provide an input interface for the user; and the processor 501 can be used to call the device control application stored in the memory 505 Procedure to achieve:

It should be understood that the computer device 50 described in the embodiment of the present application can perform the foregoing data processing method described in the foregoing embodiment corresponding to FIG. 2 and FIG. 3, and may also perform the foregoing data processing method in the foregoing embodiment corresponding to FIG. 4 The description of the device will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a computer, cause the computer to execute Method, the computer can be a part of the aforementioned computer equipment. For example, the aforementioned processor 501. As an example, the program instructions may be deployed and executed on one computer device, or be deployed on multiple computer devices located in one location, or on multiple computer devices that are distributed in multiple locations and interconnected by a communication network Execution, multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain network.

Optionally, the medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. At this time, it may include the procedures of the embodiments of the above-mentioned methods. Among them, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

The above-disclosed are only preferred embodiments of this application, and of course the scope of rights of this application cannot be limited by this. Therefore, equivalent changes made in accordance with the claims of this application still fall within the scope of this application.

Claims

A data processing method, which includes:

Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;

Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice data The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;

Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;

Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client , The first multimedia data is the reply data of the client to the composite data.
The method according to claim 1, wherein said obtaining first semantic information from said first speech data comprises:

Converting the first voice data to obtain first text data corresponding to the first voice data;

Performing keyword extraction on the first text data to obtain at least two keywords;

Acquiring word meaning information of the at least two keywords;

Obtaining word combinations according to the at least two keywords, and obtaining combined word meaning information of the word combinations;

Determining the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;

The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords, as the first semantic information.
The method according to claim 2, wherein said determining, according to said first semantic information, the second voice data matching said first semantic information comprises:

Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;

The text data corresponding to the association relationship is determined according to the association relationship between the at least two keywords, the similarity between the text data corresponding to the association relationship and the plurality of text data is obtained, and a plurality of first Two similarity

Determine the text data corresponding to the largest similarity in the plurality of first similarities as the first target text data, and determine the text data corresponding to the largest similarity in the plurality of second similarities as the second target text data ；

Determining the second text data according to the first target text data and the second target text data;

Converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
The method according to claim 3, wherein said determining from a video database the second video data matching the second voice data comprises:

Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;

Determining a target application scenario matching the semantic scenario from the application scenario;

The video data corresponding to the target application scene is determined as the second video data.
The method according to claim 1, wherein said performing synthesis processing on said second video data and said second voice data to obtain synthesized data comprises:

Acquiring the voice duration of the second voice data;

Intercepting video data with a duration equal to the voice duration from the second video data as candidate video data;

Perform synthesis processing on the second voice data and the candidate video data to obtain the synthesized data.
The method according to claim 1, wherein the method further comprises:

Intercept the first image of the user corresponding to the client from the first video data;

Verify the legitimacy of the client according to the first image;

If the client is legal, the step of performing service processing according to the first multimedia data replies from the client is executed.
The method according to claim 6, wherein the method further comprises:

If the client is not legal, sending adjustment information for instructing the user to adjust the posture to the client;

Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;

Intercepting the second image of the user from the third video data;

Verify the legitimacy of the client according to the second image.
A data processing device, which includes:

A data acquisition module, configured to receive a service processing request sent by a client, where the service processing request includes the first voice data and the first video data;

The voice matching module is configured to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for Reply data of the first voice data, the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords ；

A video matching module, configured to determine second video data matching the second voice data from a video database, the video database including a plurality of video data;

The service processing module is configured to perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and send the synthesized data to the client according to the second reply from the client. The media data is subjected to service processing, and the second multimedia data is the reply data of the client to the composite data.
A computer device, which includes: a processor, a memory, and a network interface;

The processor is connected to the memory and the network interface, wherein the network interface is used to provide a data communication function, the memory is used to store program code, and the processor is used to call the program code to execute The following methods:

Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;

Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice data The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;

Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;

Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client , The first multimedia data is the reply data of the client to the composite data.
The computer device according to claim 9, wherein when the first semantic information is obtained from the first voice data, the following is specifically executed:

Converting the first voice data to obtain first text data corresponding to the first voice data;

Performing keyword extraction on the first text data to obtain at least two keywords;

Acquiring word meaning information of the at least two keywords;

Obtaining word combinations according to the at least two keywords, and obtaining combined word meaning information of the word combinations;

Determining the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;

The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords, as the first semantic information.
The computer device according to claim 10, wherein when the second voice data matching the first semantic information is determined according to the first semantic information, the following is specifically executed:

Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;

The text data corresponding to the association relationship is determined according to the association relationship between the at least two keywords, the similarity between the text data corresponding to the association relationship and the plurality of text data is obtained, and a plurality of first Two similarity

Determine the text data corresponding to the largest similarity in the plurality of first similarities as the first target text data, and determine the text data corresponding to the largest similarity in the plurality of second similarities as the second target text data ；

Determining the second text data according to the first target text data and the second target text data;

Converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
The computer device according to claim 11, wherein when the second video data matching the second voice data is determined from the video database, the following is specifically executed:

Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;

Determining a target application scenario matching the semantic scenario from the application scenario;

The video data corresponding to the target application scene is determined as the second video data.
The computer device according to claim 9, wherein the processor is further configured to execute:

Intercept the first image of the user corresponding to the client from the first video data;

Verify the legitimacy of the client according to the first image;

If the client is legal, the step of performing service processing according to the first multimedia data replies from the client is executed.
The computer device according to claim 13, wherein the processor is further configured to execute:

If the client is not legal, sending adjustment information for instructing the user to adjust the posture to the client;

Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;

Intercepting the second image of the user from the third video data;

Verify the legitimacy of the client according to the second image.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the following method:

Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;

Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice data The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;

Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;

Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client , The first multimedia data is the reply data of the client to the composite data.
The computer-readable storage medium according to claim 15, wherein when the first semantic information is obtained from the first speech data, the following is specifically executed:

Converting the first voice data to obtain first text data corresponding to the first voice data;

Performing keyword extraction on the first text data to obtain at least two keywords;

Acquiring word meaning information of the at least two keywords;

Obtaining word combinations according to the at least two keywords, and obtaining combined word meaning information of the word combinations;

Determining the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;

The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords, as the first semantic information.
16. The computer-readable storage medium according to claim 16, wherein when the second speech data matching the first semantic information is determined according to the first semantic information, the following is specifically executed:

Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;

The text data corresponding to the association relationship is determined according to the association relationship between the at least two keywords, the similarity between the text data corresponding to the association relationship and the plurality of text data is obtained, and a plurality of first Two similarity

Determine the text data corresponding to the largest similarity in the plurality of first similarities as the first target text data, and determine the text data corresponding to the largest similarity in the plurality of second similarities as the second target text data ；

Determining the second text data according to the first target text data and the second target text data;

Converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
18. The computer-readable storage medium according to claim 17, wherein when the second video data matching the second voice data is determined from the video database, the following is specifically executed:

Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;

Determining a target application scenario matching the semantic scenario from the application scenario;

The video data corresponding to the target application scene is determined as the second video data.
The computer-readable storage medium according to claim 15, wherein the program instructions, when executed by the processor, are also used to cause the processor to execute:

Intercept the first image of the user corresponding to the client from the first video data;

Verify the legitimacy of the client according to the first image;

If the client is legal, the step of performing service processing according to the first multimedia data replies from the client is executed.
The computer-readable storage medium according to claim 19, wherein the program instructions when executed by the processor are also used to cause the processor to execute:

If the client is not legal, sending adjustment information for instructing the user to adjust the posture to the client;

Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;

Intercepting the second image of the user from the third video data;

Verify the legitimacy of the client according to the second image.