WO2021159734A1 - Data processing method and apparatus, device, and medium - Google Patents

Data processing method and apparatus, device, and medium Download PDF

Info

Publication number
WO2021159734A1
WO2021159734A1 PCT/CN2020/124256 CN2020124256W WO2021159734A1 WO 2021159734 A1 WO2021159734 A1 WO 2021159734A1 CN 2020124256 W CN2020124256 W CN 2020124256W WO 2021159734 A1 WO2021159734 A1 WO 2021159734A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
client
video
voice
voice data
Prior art date
Application number
PCT/CN2020/124256
Other languages
French (fr)
Chinese (zh)
Inventor
王锁平
周登宇
乔磊
曹传兴
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021159734A1 publication Critical patent/WO2021159734A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This application relates to voice processing technology in artificial intelligence, and in particular to a data processing method, device, equipment, and medium.
  • the embodiments of the present application provide a data processing method, device, equipment, and medium, which can improve the accuracy of voice recognition, thereby improving the accuracy of video robot calls and improving user experience.
  • One aspect of the embodiments of the present application provides a data processing method, including:
  • the second voice data is reply data to the first voice data
  • the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords
  • the media data is the reply data of the client for the composite data.
  • One aspect of the embodiments of the present application provides a data processing device, including:
  • the data acquisition module is used to receive a business processing request sent by the client, the business processing request includes the first voice data and the first video data;
  • the voice matching module is used to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice Data reply data, where the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
  • a video matching module configured to determine second video data matching the second voice data from a video database, and the video database includes a plurality of video data;
  • the service processing module is used to synthesize the second video data and the second voice data to obtain synthetic data, send the synthetic data to the client, and perform services according to the first multimedia data replies from the client Processing, the first multimedia data is the reply data of the client for the composite data.
  • One aspect of this application provides a computer device, including: a processor, a memory, and a network interface;
  • the above-mentioned processor is connected to a memory and a network interface, wherein the network interface is used to provide data communication functions, the above-mentioned memory is used to store computer programs, and the above-mentioned processor is used to call the above-mentioned computer programs to execute the following methods:
  • the second voice data is reply data to the first voice data
  • the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords
  • the media data is the reply data of the client for the composite data.
  • One aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the following method :
  • the second voice data is reply data to the first voice data
  • the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords
  • the media data is the reply data of the client for the composite data.
  • the embodiments of the present application enable higher accuracy of voice recognition, thereby improving the accuracy of video robot calls, making human-computer interaction more natural, thereby increasing the interest of human-computer interaction and improving user experience.
  • FIG. 1 is a schematic diagram of the architecture of an information processing system provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the composition structure of a data processing device provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the composition structure of a computer device provided by an embodiment of the present application.
  • the technical solution of this application can be applied to the fields of artificial intelligence, smart city, blockchain and/or big data technology.
  • the data involved in this application such as synthetic data and/or multimedia data, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Speech Technology includes automatic speech recognition technology (Automated Speech Recognition, ASR), speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • This application involves voice processing technology in artificial intelligence, and the technical solution of this application is suitable for remote business processing scenarios such as remote face-to-face review, video return visits, and remote account opening.
  • Use voice processing technology to obtain first semantic information from the first voice data, determine second voice data matching the first semantic information according to the first semantic information, and determine second video data matching the second voice data from the video database , Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. Since the first semantic information is obtained from the first voice data in this application, each keyword in the text data corresponding to the first voice data and the relationship between each keyword are combined to obtain the first semantic information, so the first semantic information is obtained.
  • a semantic information can more accurately reflect the meaning of the first voice data, that is, the accuracy of voice recognition is higher, and the second voice data can be more accurately responded to according to the first voice data.
  • This application can be applied to the fields of smart government affairs, smart education, etc., and is conducive to promoting the construction of smart cities.
  • FIG. 1 is a schematic diagram of the architecture of an information processing system provided by an embodiment of the present application.
  • the schematic diagram of the system architecture includes a client 11 and a business processing server 12 corresponding to a business processing platform.
  • the client 11 may refer to a terminal that sends a service processing request;
  • the service processing server 12 may refer to a back-end service device that performs information processing, and the information processing may include acquiring first semantic information in the first voice data, and The semantic information determines the second voice data that matches the first semantic information, determines the second video data that matches the second voice data from the video database, and synthesizes the second video data and the second voice data to obtain synthesized data ,and many more.
  • the business processing server 12 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or it may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the server can independently perform information processing; when the business processing server 12 is multiple physical servers, multiple physical servers can cooperate to perform information processing. For example, one server can obtain the first semantic information in the first voice data, the other server can determine the second voice data matching the first semantic information according to the first semantic information, and the other server can compare the second video data with the first semantic information.
  • the speech data is synthesized and processed to obtain synthesized data, and so on.
  • the client 12 can be a computer device, including a mobile phone, a tablet computer, a notebook computer, a handheld computer, a smart speaker, a mobile internet device (MID), a POS (Point Of Sales) machine, and a wearable device (such as Smart watches, smart bracelets, etc.) etc.
  • the number of clients 12 may be one or more.
  • the embodiment of the present application is described with one client. For multiple clients, this method can be referred to for processing.
  • the business processing can also be performed by the client, and the manner in which the client performs business processing can refer to the manner in which the business processing server performs business processing. The following uses the business processing server to perform business processing as an example for description.
  • a user needs to handle a certain service.
  • the service handling request may include the first voice data and the first video data.
  • the service processing server receives the service processing request sent by the client, and obtains first semantic information from the first voice data.
  • the first semantic information may include a service identifier, which may be, for example, a service name, a service code, and so on.
  • the first semantic information may reflect at least two keywords in the first text data corresponding to the first voice data, and the connection between the at least two keywords. That is to say, the first semantic information is determined based on the connection between at least two keywords and at least two keywords.
  • the service processing server determines the second video data matching the second voice data from the video database.
  • the second voice data is voice data that instructs the user to perform a certain service processing process
  • the second video data may be silent video data of a video robot simulating a human with a mouth opened, closed, eyes turned, and a smile on the face.
  • the second voice data is voice data indicating that the user's current network connection fails, and the second video data may be silent video data that the video robot simulates frustration, disappointment, etc. of a human.
  • the service processing server synthesizes the silent second video data and the second voice data, obtains the synthesized data, and sends the synthesized data to the client. The user can see the video robot talking with him through the client.
  • each keyword in the text data corresponding to the first voice data is extracted when the first semantic information in the first voice data is obtained, and the association relationship between each keyword is obtained, more accuracy can be achieved.
  • the accuracy of voice recognition is higher, so as to achieve a more accurate reply to the second voice data, and to improve the accuracy and flexibility of the video robot call.
  • FIG. 2 is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 1, the method includes:
  • S101 Receive a service processing request sent by a client.
  • the service processing request includes the first voice data and the first video data.
  • the first voice data is obtained through the client's voice recording of the user's voice
  • the first video data is the user's posture, facial expression, Actions and other data are obtained by video recording.
  • the client may refer to a terminal used by a user for service processing.
  • the first voice data may include a service identifier, and the service identifier may include a service name, an abbreviation of the service name, a service code, etc., which are used to uniquely indicate the service.
  • Business refers to the types of services that users need to handle, such as purchasing property insurance, bank loans, bank card processing, credit card processing, and so on. Alternatively, the business may also include services required by the user, such as bank card balance inquiry, credit card limit inquiry, and so on.
  • the user may send a call request through the client, the service processing server obtains the call request, establishes a call connection with the client according to the call request, and receives the service processing request sent by the client through the call connection.
  • the service processing server can correspond to multiple video extensions, and each video extension has the same function.
  • the call request can include the video extension number, and the client will be assigned the video extension corresponding to the video extension number according to the video extension number. Realize the call connection with the client.
  • an idle video extension can be assigned to the client.
  • the video extension can be matched to the call request sent by the client according to the waiting time of each idle video extension. For example, the idle video extension with the longest waiting time can be matched to the client, or the idle video extension with the shortest waiting time can be matched to the client. The video extension is matched to the client, and so on.
  • Call connection includes video connection, voice connection, and so on.
  • the video connection is used to obtain the video data sent by the client
  • the voice connection is used to obtain the voice data sent by the client.
  • the client can send the first video data to the data processing server in other ways, for example, the first video data can be sent to the data processing server by video transmission.
  • the client when the client establishes a call connection with the service processing server, the client can send a call request to the service processing server, and the call request can be verified through network switching, firewall, etc., such as verifying whether the call request is safe or not. Carrying viruses, and whether the call request is in a format that can be recognized by the service processing server, etc., establish a call connection with the service processing server after the verification is passed.
  • the service processing server may send pre-stored voice data indicating that the user is welcome to the client, and the user may determine that the call connection is established successfully based on the voice data, and then send Business handling request.
  • the pre-stored voice data indicating that the user is welcome can be "Hello, welcome to call, what can I do for you".
  • the text data indicating that the user is welcome can be stored in the service processing server, and the service processing server converts the text data into voice data, and then sends the voice data to the client , Since the storage space occupied by text data is less than the storage space occupied by voice data, the way of storing text data can save storage space.
  • S102 Acquire first semantic information from the first voice data, and determine, according to the first semantic information, second voice data that matches the first semantic information.
  • the second voice data is reply data for the first voice data
  • the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the difference between the at least two keywords connection relation.
  • a method for obtaining the first semantic information from the first voice data may be: converting the first voice data to obtain the first text data corresponding to the first voice data; and applying keywords to the first text data Extract to obtain at least two keywords; obtain word meaning information of at least two keywords; obtain word combinations according to at least two keywords, obtain combined word meaning information of word combinations; determine at least two keywords based on word meaning information and combined word meaning information The association relationship between the at least two keywords and the association relationship between the at least two keywords to determine the semantic information corresponding to the first text data as the first semantic information.
  • the voice-type data can be converted into text-type data to obtain the first text data.
  • the business processing platform may perform word segmentation processing on the first text data, and divide the first text data into at least one word segmentation; obtain a stop word set, and the stop word set includes at least one word unrelated to the business; Search for a target word that matches the at least one participle in the word set; delete the target word in the at least one participle; perform keyword extraction on at least one participle after deleting the target word to obtain at least two keywords.
  • the first text data is "I want to buy auto insurance, but I need a loan if I don’t have enough funds.”
  • the result of word segmentation processing is "I want to buy auto insurance but I don’t have enough funds and need a loan.”
  • the 10 participles are matched with each stop word in the stop word set. If the 4 participles of "I”, “ ⁇ ”, “but”, and “need” are matched, these 4 participles will be deleted to obtain “Buy auto insurance funds before loan”, extract keywords for "buy auto insurance funds before loan”, get 6 keywords “buy”, “auto insurance”, “funds”, “insufficient”, “first”, “loan” ".
  • ASR technology or other technologies can be used to convert voice data into text data, thereby extracting keywords in the text data.
  • the semantic information of these 6 keywords is obtained; a word combination is obtained based on the 6 keywords, for example, the word combination is "Buy car insurance funds, loan first” , Obtain the combined word meaning information of the word combination; determine the association relationship between at least two keywords according to the word meaning information of the 6 keywords and the combined word meaning information, and the meaning of the first text data can be obtained as "loan first, then buy auto insurance” ; Determine the semantic information corresponding to the first text data according to the at least two keywords and the association relationship between the at least two keywords, as the first semantic information, then the first semantic information is "loan, purchase car insurance.”
  • another method for obtaining the first semantic information from the first voice data may be: recognizing the first voice data to obtain at least two keywords in the first voice data.
  • the first voice data is: "I want to buy auto insurance, but I need a loan first if I don’t have enough funds.”
  • At least two keywords identified include “buy”, “auto insurance”, “funds”, “insufficient”, “first”, For "loan”, the first semantic information obtained is "loan, purchase auto insurance”.
  • the first voice data may be converted into text data for keyword extraction according to specific requirements, or voice recognition may be performed on the first voice data to obtain at least two keywords.
  • the method for determining second voice data matching the first semantic information according to the first semantic information may include several steps:
  • the corpus is a database corresponding to the business processing server.
  • the corpus can contain text data related to business processing, such as specific process information for business processing; it can also include text data that has nothing to do with business processing.
  • the similarity calculation method can be used to calculate the similarity between each keyword and each text data in the corpus to obtain multiple first similarities.
  • the similarity calculation methods can include Pearson correlation coefficient method and Cosine similarity There are no restrictions on the degree of law, etc. here.
  • the text data corresponding to the association relationship is determined according to the association relationship between at least two keywords, and the similarity between the text data corresponding to the association relationship and the multiple text data is obtained, and multiple second similarities are obtained.
  • association relationship between at least two keywords refers to the relationship between at least two keywords, for example, it can indicate the order of the businesses corresponding to the keywords, etc., so the corresponding relationship can be determined according to the relationship.
  • Text data such as at least two keywords including "buy”, “car insurance”, “funds”, “insufficient”, “first”, and “loan”.
  • the obtained contains The text data of the business is "purchase auto insurance loan”.
  • the obtained text data corresponding to the association relationship of the business is "purchase auto insurance loan", that is, the loan business is processed first, and then processed Buy auto insurance business.
  • the similarity calculation method can be used to calculate the similarity between the text data corresponding to the association relationship and the multiple text data in the corpus to obtain multiple second similarities.
  • the similarity calculation method can include Pearson correlation coefficient Method, Cosine similarity method, etc., are not limited here.
  • the similarity between each keyword in the at least two keywords and each text data in the multiple text data in the corpus is calculated to obtain a first similarity Therefore, multiple first similarities can be obtained based on at least two keywords and multiple text data. For example, if the number of keywords is n1 and the number of text data in the corpus is m1, then n1*m1 first similarities can be calculated. Correspondingly, the number of text data corresponding to the association relationship is n2, and the number of text data in the corpus is n2. If the number is m1, n2*m1 second similarities can be calculated.
  • the first target text data can be determined as the second text data; if the first target text data is different from the second target text data, the first target text data can be The second target text data is determined as the second text data.
  • natural language processing technology Natural Language Processing, NLP
  • NLP Natural Language Processing
  • TTS technology or other technologies can be used to convert text data into voice data.
  • the user can obtain the second voice data through the client, and then reply the first multimedia data according to the second voice data, so as to realize the service processing platform according to the first reply of the user.
  • Corresponding services for multimedia data processing By converting the second text data into voice data, the user can learn the content of the voice data more intuitively. Compared with the way the user directly views the second text data, the way of converting the text data into voice data can improve the user The efficiency of viewing, thereby improving the efficiency of business processing.
  • S103 Determine second video data matching the second voice data from the video database.
  • the video database includes multiple video data.
  • it can include multiple types of video data, specifically, it can include silent video data such as a video robot simulating a person’s mouth opening, closing, eye turning, and a smile, or a video robot simulating a person’s frustration, disappointment, etc.
  • the multiple types of video data can be stored in the video database in advance, so as to facilitate subsequent use.
  • the method for determining the second video data matching the second voice data from the video database may be: obtaining the semantic scene of the second text data and the application scenario of each video data in the video database; Determine the target application scene matching the semantic scene; determine the video data corresponding to the target application scene as the second video data.
  • the second text data is text data corresponding to the second voice data, that is, the second text data and the second voice data have the same meaning, but the two types of data have different manifestations, and the second text data is in text form.
  • the second voice data is in the form of sound.
  • the semantic scenario of the second text data refers to the meaning of the second text data. For example, it can include specific business processing procedures, instructions to instruct users to wait, and troubleshooting prompt information, etc. Handling failures can include, for example, network connection failure, The server is busy and so on.
  • the application scenario of the video data can be determined according to the type of the video data.
  • the semantic scenario is a specific business processing process
  • the target application scenario that matches the semantic scenario may be silent video data such as a video robot simulating a human with a mouth opened, closed, eyes turned, and a smile on the face.
  • the semantic scene is instruction information that instructs the user to wait
  • the target application scene that matches the semantic scene may be silent video data in which the video robot simulates human expressions such as guilt.
  • the semantic scene is the fault prompt information
  • the target application scene that matches the semantic scene can be silent video data that the video robot simulates human frustration, disappointment, etc.
  • S104 Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client.
  • the first multimedia data is the reply data of the client to the synthesized data.
  • the business processing server sends the synthetic data to the client. After the user learns the synthetic data through the client, the user will respond according to the synthetic data, such as answering questions in the synthetic data, and filling in the user's information according to the prompt information in the synthetic data. Identity information, upload corresponding identity documents, etc.
  • the client terminal obtains the first multimedia data by collecting the data that the user replies based on the synthesized data, for example, recording the user's reply voice, and recording the user's actions and expressions.
  • the client sends the first multimedia data to the service processing server, and the service processing server performs service processing according to the first multimedia data.
  • performing synthesis processing on the second video data and the second voice data to obtain the synthesized data may be: obtaining the voice duration of the second voice data; intercepting video data equal to the voice duration from the second video data, As the candidate video data; perform synthesis processing on the second voice data and the candidate video data to obtain synthesized data.
  • the second video data is determined as the candidate video data. If the video duration of the second video data is greater than the voice duration of the second voice data, video data equal to the voice duration of the second voice data may be intercepted from the second video data as candidate video data. For example, the video duration of the second video data is 5 seconds, and the voice duration of the second voice data is 3 seconds, then 3 seconds of video data can be obtained from the 5 second second video data clock as candidate video data. If the video duration of the second video data is less than the voice duration of the second voice data, multiple second video data may be connected to obtain video data equal to the voice duration of the second voice data as candidate video data.
  • the video duration of the second video data is 3 seconds
  • the voice duration of the second voice data is 6 seconds
  • the same 3-second second video data can be connected to obtain 6-second video data as candidate video data.
  • the video data with sound may be obtained as the synthesized data
  • the sound of the synthesized data is the second voice data.
  • the second voice data and the candidate video data are sent to the client at the same time, and the client plays the second voice data and the candidate video data at the same time, and the synthesized data includes the second voice data and the candidate video data.
  • the service processing server when it sends voice data to the client, it will determine the video data matching the voice data from the video database, and then send the synthesized data to the client based on the voice data and the video data.
  • the dialogue between clients makes human-computer interaction more natural.
  • the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first
  • the meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved.
  • the second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business.
  • the efficiency of business processing can be improved.
  • the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.
  • FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 3, the method includes:
  • S201 Receive a service processing request sent by a client.
  • S202 Acquire first semantic information from the first voice data, and determine, according to the first semantic information, second voice data that matches the first semantic information.
  • S203 Determine second video data matching the second voice data from the video database.
  • S204 Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, and send the synthesized data to the client.
  • steps S201 to S204 reference may be made to the description of steps S101 to S104 in the embodiment corresponding to FIG. 2, which will not be repeated here.
  • S205 Intercept the first image of the user corresponding to the client from the first video data.
  • the first video data is obtained through video recording of the user's posture, expression, and action data by the client. It can be seen that the first video data includes the facial image of the user.
  • the service processing server may intercept the first video data every preset time to obtain the first image containing the user's face, that is, obtain the first image of the user corresponding to the client. For example, the image in the first video data may be intercepted every 0.5 seconds to obtain the first image. For example, if the duration of the first video data is 2 seconds, the number of first images of the user acquired is 4.
  • S206 Verify the legitimacy of the client according to the first image.
  • verifying the legitimacy of the client according to the first image may refer to verifying whether the facial image of the user in the first image and the user image stored by the service processing server are the facial image of the same user, and if so, determining the legitimacy of the client. If not, it is determined that the client is not legal, and the adjustment information used to instruct the user to adjust the posture is sent to the client; the second multimedia data sent by the client for the adjustment information is obtained; and the third video data is intercepted The second image of the user; verify the legitimacy of the client according to the second image.
  • the second multimedia data includes the third video data
  • the facial information of the user stored by the service processing server may be based on the facial information stored by the user when the historical service is handled by the service processing server.
  • the user's facial information stored by the business processing server may be the user's facial information reserved when the user has handled the bank card in the business processing server.
  • the user’s facial information can be obtained from other servers that store the user’s facial information, for example, from the Ministry of Public Security , Ministry of Civil Affairs and other institutions to obtain the user’s facial information from the corresponding server.
  • the business processing server verifies that the user's facial image in the first image and the user image stored by the business processing server are not the same user's facial image, it determines that the client is not legal, and then sends adjustment information for instructing the user to adjust the posture To the client, so that the user can adjust the posture according to the adjustment information. For example, when the user's face is not aligned with the client's camera, the adjusted user's face is aligned with the client's camera; or, when the client's camera includes user A and user B, and user A is a user who needs to handle business , The adjusted client’s camera only includes user A.
  • the service processing server obtains the second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data; obtains the user's second image according to the third video data; and obtains the user's second image from the third video data. Intercept the user's second image; verify the legitimacy of the client according to the second image.
  • the second image includes the user’s facial image. If the second image and the user’s facial image stored in the service processing server are the same user’s facial image, the client has legitimacy, and the client will respond according to the first multimedia data Perform business processing.
  • the client has no legality and ends the business processing, and the output is used to instruct the user to perform the processing at the manual business processing office corresponding to the business processing server Business processing and termination of business processing.
  • the service processing is performed according to the first multimedia data replies from the client.
  • the service processing server may also perform secondary verification on the legitimacy of the client according to the first multimedia data, for example, may obtain the user from the first multimedia data.
  • the third image when answering the question for example, there are 3 questions in the composite data, and the image when the user answers the 3 questions is intercepted in the first multimedia data to obtain the third image containing the user's facial image.
  • micro-expression recognition By performing micro-expression recognition on the third image, the authenticity of the question answered by the user is determined according to the micro-expression when the user answers the question. If the authenticity of the question answered by the user is determined to be high through micro-expression recognition, then the business is processed.
  • the instruction information for re-verifying the user's identity is sent or the question with abnormal micro-expression when the user answers the question is output again. If the second verification is passed or the user's facial expression when answering the question again indicates that the authenticity of the question answered by the user is high, then the service is processed. If the secondary verification fails or the facial expression when the user answers the question again indicates that the authenticity of the question answered by the user is low, the output is used to instruct the user to handle the business at the manual business processing office corresponding to the business processing server, and end the processing of the business .
  • the micro-expression recognition can be used to verify the authenticity of the content of the user's speech, so that the accuracy of data recognition can be improved.
  • the authenticity of the user's identity can be improved, and the service processing server can perform the verification on the third image in the first multimedia data.
  • Micro-expression recognition can identify the authenticity of the question answered by the user, thereby realizing the second verification of the user's identity information and improving the accuracy of business processing.
  • the client’s legitimacy is verified, that is, the user’s identity information is verified. If the client’s legitimacy is verified, it means that the user’s identity information is true.
  • Corresponding business processing in the case of verifying that the client is not legal, the user is prompted to adjust the posture by outputting adjustment information to realize the verification of the legitimacy of the client, which can improve the authenticity of user identity information verification, thereby improving business processing Accuracy.
  • FIG. 4 is a schematic diagram of the composition structure of a data processing device provided by an embodiment of the present application.
  • the above data processing device may be a computer program (including program code) running in a computer device.
  • the data processing device is An application software; the device can be used to execute the corresponding steps in the method provided in the embodiments of this application.
  • the device 40 includes:
  • the data acquisition module 401 is configured to receive a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
  • the voice matching module 402 is configured to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first semantic information Voice data reply data, where the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
  • the video matching module 403 is configured to determine second video data matching the second voice data from a video database, and the video database includes multiple video data;
  • the service processing module 404 is used to synthesize the second video data and the second voice data to obtain synthetic data, send the synthetic data to the client, and perform processing according to the first multimedia data replies from the client For service processing, the first multimedia data is the reply data of the client to the composite data.
  • the voice matching module 402 is specifically used for:
  • the semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords as the first semantic information.
  • the voice matching module 402 is specifically used for:
  • the second text data is converted, and the voice data corresponding to the second text data is obtained as the second voice data.
  • the video matching module 403 is specifically used for:
  • the video data corresponding to the target application scene is determined as the second video data.
  • the business processing module 404 is specifically used for:
  • Intercepting video data equal to the voice duration from the second video data as candidate video data
  • the device 40 further includes: a legality verification module 405, configured to:
  • the step of performing service processing according to the first multimedia data returned by the client is executed.
  • the device 40 further includes: an information adjustment module 406, configured to:
  • the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first
  • the meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved.
  • the second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business.
  • the efficiency of business processing can be improved.
  • the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.
  • FIG. 5 is a schematic diagram of the composition structure of a computer device provided by an embodiment of the present application.
  • the foregoing computer device 50 may include: a processor 501, a network interface 504, and a memory 505.
  • the foregoing computer device 50 may also include: a user interface 503, and at least one communication bus 502.
  • the communication bus 502 is used to implement connection and communication between these components.
  • the user interface 503 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 503 may also include a standard wired interface and a wireless interface.
  • the network interface 504 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 405 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the memory 505 may also be at least one storage device located far away from the aforementioned processor 501.
  • the memory 505, which is a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 504 can provide network communication functions;
  • the user interface 503 is mainly used to provide an input interface for the user; and
  • the processor 501 can be used to call the device control application stored in the memory 505 Procedure to achieve:
  • the second voice data is reply data to the first voice data
  • the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords
  • the media data is the reply data of the client for the composite data.
  • the computer device 50 described in the embodiment of the present application can perform the foregoing data processing method described in the foregoing embodiment corresponding to FIG. 2 and FIG. 3, and may also perform the foregoing data processing method in the foregoing embodiment corresponding to FIG. 4
  • the description of the device will not be repeated here.
  • the description of the beneficial effects of using the same method will not be repeated.
  • the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first
  • the meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved.
  • the second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business.
  • the efficiency of business processing can be improved.
  • the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a computer, cause the computer to execute Method, the computer can be a part of the aforementioned computer equipment.
  • the aforementioned processor 501 the program instructions may be deployed and executed on one computer device, or be deployed on multiple computer devices located in one location, or on multiple computer devices that are distributed in multiple locations and interconnected by a communication network Execution, multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain network.
  • the medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • the program can be stored in a computer readable storage medium. At this time, it may include the procedures of the embodiments of the above-mentioned methods.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed in embodiments of the present application are a data processing method, an apparatus, a device, and a medium, relating to voice processing technology in artificial intelligence, and applicable to a blockchain network. The method comprises: receiving a service handling request sent by a client; obtaining first semantic information from the first voice data and determining, on the basis of the first semantic information, second voice data matching said first semantic information; determining, from a video database, second video data matching the second voice data; combining the second video data and the second voice data to obtain synthesized data and sending the synthesized data to the client; according to second multimedia data returned by the client, performing service processing. The second multimedia data is reply data returned by the client in respect of the synthesized data. Using embodiments of the present invention improves accuracy of voice recognition and enhances user experience.

Description

一种数据处理方法、装置、设备及介质Data processing method, device, equipment and medium
本申请要求于2020年9月22日提交中国专利局、申请号为202011006333.7,发明名称为“一种数据处理方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 22, 2020, the application number is 202011006333.7, and the invention title is "a data processing method, device, equipment and medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能中的语音处理技术,尤其涉及一种数据处理方法、装置、设备及介质。This application relates to voice processing technology in artificial intelligence, and in particular to a data processing method, device, equipment, and medium.
背景技术Background technique
传统的业务办理一般为通过人工进行业务办理的模式,由于人工进行业务办理的方式无法实现全天候业务办理,并且该种方式需要投入的成本较高。因此,视频机器人通话进行业务办理的方式逐渐取代人工进行业务办理的方式,且可以实现随时随地的业务办理。Traditional business processing is generally a mode of manual business processing, because the manual business processing method cannot achieve all-weather business processing, and the cost of this method is relatively high. Therefore, the way of handling business by video robot calls gradually replaces the way of handling business manually, and can realize business handling anytime, anywhere.
发明人意识到,目前的视频机器人通话一般较为固定,视频机器人只能识别固定话术进行相应回复,导致识别到的语义信息不够全面,语音识别准确度较低,因此视频机器人通话的准确性较低,用户体验较差。The inventor realized that the current video robot calls are generally relatively fixed, and video robots can only recognize fixed words and respond accordingly, resulting in insufficient semantic information recognized and low voice recognition accuracy. Therefore, the accuracy of video robot calls is relatively high. Low, poor user experience.
发明内容Summary of the invention
本申请实施例提供一种数据处理方法、装置、设备及介质,可提高语音识别的准确性,从而提高视频机器人通话的准确性,提升用户体验。The embodiments of the present application provide a data processing method, device, equipment, and medium, which can improve the accuracy of voice recognition, thereby improving the accuracy of video robot calls and improving user experience.
本申请实施例一方面提供一种数据处理方法,包括:One aspect of the embodiments of the present application provides a data processing method, including:
接收客户端所发送的业务办理请求,该业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
从该第一语音数据中获取第一语义信息,根据该第一语义信息确定与该第一语义信息匹配的第二语音数据;该第二语音数据是针对该第一语音数据的回复数据,该第一语义信息用于反映该第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is reply data to the first voice data, the The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
从视频数据库中确定与该第二语音数据匹配的第二视频数据,该视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
对该第二视频数据与该第二语音数据进行合成处理,得到合成数据,将该合成数据发送至该客户端,根据该客户端回复的第一多媒体数据进行业务处理,该第一多媒体数据是该客户端针对该合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. The media data is the reply data of the client for the composite data.
本申请实施例一方面提供一种数据处理装置,包括:One aspect of the embodiments of the present application provides a data processing device, including:
数据获取模块,用于接收客户端所发送的业务办理请求,该业务办理请求包括第一语音数据和第一视频数据;The data acquisition module is used to receive a business processing request sent by the client, the business processing request includes the first voice data and the first video data;
语音匹配模块,用于从该第一语音数据中获取第一语义信息,根据该第一语义信息确定与该第一语义信息匹配的第二语音数据;该第二语音数据是针对该第一语音数据的回复数据,该第一语义信息用于反映该第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;The voice matching module is used to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice Data reply data, where the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
视频匹配模块,用于从视频数据库中确定与该第二语音数据匹配的第二视频数据,该视频数据库中包括多个视频数据;A video matching module, configured to determine second video data matching the second voice data from a video database, and the video database includes a plurality of video data;
业务处理模块,用于对该第二视频数据与该第二语音数据进行合成处理,得到合成数据,将该合成数据发送至该客户端,根据该客户端回复的第一多媒体数据进行业务处理,该第一多媒体数据是该客户端针对该合成数据的回复数据。The service processing module is used to synthesize the second video data and the second voice data to obtain synthetic data, send the synthetic data to the client, and perform services according to the first multimedia data replies from the client Processing, the first multimedia data is the reply data of the client for the composite data.
本申请一方面提供了一种计算机设备,包括:处理器、存储器、网络接口;One aspect of this application provides a computer device, including: a processor, a memory, and a network interface;
上述处理器与存储器、网络接口相连,其中,网络接口用于提供数据通信功能,上述存储器用于存储计算机程序,上述处理器用于调用上述计算机程序,以执行以下方法:The above-mentioned processor is connected to a memory and a network interface, wherein the network interface is used to provide data communication functions, the above-mentioned memory is used to store computer programs, and the above-mentioned processor is used to call the above-mentioned computer programs to execute the following methods:
接收客户端所发送的业务办理请求,该业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
从该第一语音数据中获取第一语义信息,根据该第一语义信息确定与该第一语义信息匹配的第二语音数据;该第二语音数据是针对该第一语音数据的回复数据,该第一语义信息用于反映该第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is reply data to the first voice data, the The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
从视频数据库中确定与该第二语音数据匹配的第二视频数据,该视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
对该第二视频数据与该第二语音数据进行合成处理,得到合成数据,将该合成数据发送至该客户端,根据该客户端回复的第一多媒体数据进行业务处理,该第一多媒体数据是该客户端针对该合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. The media data is the reply data of the client for the composite data.
本申请实施例一方面提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行以下方法:One aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the following method :
接收客户端所发送的业务办理请求,该业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
从该第一语音数据中获取第一语义信息,根据该第一语义信息确定与该第一语义信息匹配的第二语音数据;该第二语音数据是针对该第一语音数据的回复数据,该第一语义信息用于反映该第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is reply data to the first voice data, the The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
从视频数据库中确定与该第二语音数据匹配的第二视频数据,该视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
对该第二视频数据与该第二语音数据进行合成处理,得到合成数据,将该合成数据发送至该客户端,根据该客户端回复的第一多媒体数据进行业务处理,该第一多媒体数据是该客户端针对该合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. The media data is the reply data of the client for the composite data.
本申请实施例使得语音识别的准确度更高,从而可以提高视频机器人通话的准确性,使得人机交互更自然,从而增加人机交互的趣味性,提升用户体验。The embodiments of the present application enable higher accuracy of voice recognition, thereby improving the accuracy of video robot calls, making human-computer interaction more natural, thereby increasing the interest of human-computer interaction and improving user experience.
附图说明Description of the drawings
图1是本申请实施例提供的一种信息处理系统的架构示意图;FIG. 1 is a schematic diagram of the architecture of an information processing system provided by an embodiment of the present application;
图2是本申请实施例提供的一种数据处理方法的流程示意图;2 is a schematic flowchart of a data processing method provided by an embodiment of the present application;
图3是本申请实施例提供的一种数据处理方法的流程示意图;FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application;
图4是本申请实施例提供的一种数据处理装置的组成结构示意图;4 is a schematic diagram of the composition structure of a data processing device provided by an embodiment of the present application;
图5是本申请实施例提供的一种计算机设备的组成结构示意图。FIG. 5 is a schematic diagram of the composition structure of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.
本申请的技术方案可应用于人工智能、智慧城市、区块链和/或大数据技术领域。可选的,本申请涉及的数据如合成数据和/或多媒体数据等可存储于数据库中,或者可以存储于区块链中,本申请不做限定。The technical solution of this application can be applied to the fields of artificial intelligence, smart city, blockchain and/or big data technology. Optionally, the data involved in this application, such as synthetic data and/or multimedia data, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
其中,语音处理技术(Speech Technology)的关键技术有自动语音识别技术(Automated Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。 让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。Among them, the key technologies of speech processing technology (Speech Technology) include automatic speech recognition technology (Automated Speech Recognition, ASR), speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
本申请涉及人工智能中的语音处理技术,本申请的技术方案适用于远程面审、视频回访、远程开户等远程业务办理的场景中。利用语音处理技术从第一语音数据中获取第一语义信息,根据第一语义信息确定与第一语义信息匹配的第二语音数据,从视频数据库中确定与第二语音数据匹配的第二视频数据,对第二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端,根据客户端回复的第一多媒体数据进行业务处理。由于本申请中从第一语音数据中获取第一语义信息时,结合了第一语音数据对应的文本数据中的各个关键词,以及各个关键词之间的联系,得到第一语义信息,因此第一语义信息可以更准确反映第一语音数据的含义,即语音识别准确度更高,实现更准确地根据第一语音数据回复第二语音数据。本申请可适用于智慧政务、智慧教育等领域,有利于推动智慧城市的建设。This application involves voice processing technology in artificial intelligence, and the technical solution of this application is suitable for remote business processing scenarios such as remote face-to-face review, video return visits, and remote account opening. Use voice processing technology to obtain first semantic information from the first voice data, determine second voice data matching the first semantic information according to the first semantic information, and determine second video data matching the second voice data from the video database , Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. Since the first semantic information is obtained from the first voice data in this application, each keyword in the text data corresponding to the first voice data and the relationship between each keyword are combined to obtain the first semantic information, so the first semantic information is obtained. A semantic information can more accurately reflect the meaning of the first voice data, that is, the accuracy of voice recognition is higher, and the second voice data can be more accurately responded to according to the first voice data. This application can be applied to the fields of smart government affairs, smart education, etc., and is conducive to promoting the construction of smart cities.
请参见图1,图1是本申请实施例提供的一种信息处理系统的架构示意图,该系统架构示意图包括客户端11和业务处理平台对应的业务处理服务器12。其中,客户端11可以是指发送业务办理请求的终端;业务处理服务器12可以是指进行信息处理的后端服务设备,信息处理可以包括获取第一语音数据中的第一语义信息、根据第一语义信息确定与第一语义信息匹配的第二语音数据、从视频数据库中确定与第二语音数据匹配的第二视频数据、以及对第二视频数据与第二语音数据进行合成处理,得到合成数据,等等。Please refer to FIG. 1. FIG. 1 is a schematic diagram of the architecture of an information processing system provided by an embodiment of the present application. The schematic diagram of the system architecture includes a client 11 and a business processing server 12 corresponding to a business processing platform. Among them, the client 11 may refer to a terminal that sends a service processing request; the service processing server 12 may refer to a back-end service device that performs information processing, and the information processing may include acquiring first semantic information in the first voice data, and The semantic information determines the second voice data that matches the first semantic information, determines the second video data that matches the second voice data from the video database, and synthesizes the second video data and the second voice data to obtain synthesized data ,and many more.
业务处理服务器12可以是独立的一个物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。当业务处理服务器12为独立的一个物理服务器时,该服务器可以独立进行信息处理;当业务处理服务器12为多个物理服务器时,可以由多个物理服务器协同合作进行信息处理。例如其中一个服务器可以获取第一语音数据中的第一语义信息,另一个服务器可以根据第一语义信息确定与第一语义信息匹配的第二语音数据,再一个服务器可以对第二视频数据与第二语音数据进行合成处理,得到合成数据,等等。客户端12可以为计算机设备,包括手机、平板电脑、笔记本电脑、掌上电脑、智能音响、移动互联网设备(MID,mobile internet device)、POS(Point Of Sales,销售点)机、可穿戴设备(例如智能手表、智能手环等)等。客户端12的数量可以为一个或者多个,本申请实施例是以一个客户端进行说明,针对多个客户端,可参考该方式进行处理。需要说明的是,业务处理还可以由客户端来执行,客户端进行业务处理的方式可以参考业务处理服务器进行业务处理的方式,下面以业务处理服务器进行业务处理为例进行说明。The business processing server 12 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or it may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms. When the business processing server 12 is an independent physical server, the server can independently perform information processing; when the business processing server 12 is multiple physical servers, multiple physical servers can cooperate to perform information processing. For example, one server can obtain the first semantic information in the first voice data, the other server can determine the second voice data matching the first semantic information according to the first semantic information, and the other server can compare the second video data with the first semantic information. Second, the speech data is synthesized and processed to obtain synthesized data, and so on. The client 12 can be a computer device, including a mobile phone, a tablet computer, a notebook computer, a handheld computer, a smart speaker, a mobile internet device (MID), a POS (Point Of Sales) machine, and a wearable device (such as Smart watches, smart bracelets, etc.) etc. The number of clients 12 may be one or more. The embodiment of the present application is described with one client. For multiple clients, this method can be referred to for processing. It should be noted that the business processing can also be performed by the client, and the manner in which the client performs business processing can refer to the manner in which the business processing server performs business processing. The following uses the business processing server to perform business processing as an example for description.
在实际应用中,例如用户需要办理某业务,首先,用户可以通过客户端向业务处理服务器发送业务办理请求,业务办理请求可以包括第一语音数据和第一视频数据。接着,业务处理服务器接收客户端发送的业务办理请求,从第一语音数据中获取第一语义信息,第一语义信息可以包含业务标识,业务标识例如可以为业务名称、业务代码等等。并且,第一语义信息可以反映第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的联系。也就是说,第一语义信息是根据至少两个关键词和至少两个关键词之间的联系确定的。然后,业务处理服务器从视频数据库中确定与第二语音数据匹配的第二视频数据。例如第二语音数据为指示用户进行某业务办理的流程的语音数据,则第二视频数据可以为视频机器人模拟人说话时嘴巴张开、闭合、眼睛转动、以及面带微笑等的无声视频数据。或者,第二语音数据为指示用户当前网络连接失败的语音数据,则第二视频数据可以为视频机器人模拟人沮丧、失望等的无声视频数据。业务处理服务器对无声的第 二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端,用户可以通过客户端看到视频机器人与自己对话。这样,通过对视频数据和语音数据进行合成,可以实现与客户端的用户进行实时视频通话,提升用户体验。由于在获取第一语音数据中的第一语义信息时,对第一语音数据对应的文本数据中的每个关键词进行了提取,以及获取了各个关键词之间的关联关系,可以实现更准确的确定第一语音数据的含义,即语音识别准确度更高,从而实现更准确地回复第二语音数据,提高视频机器人通话的准确性和灵活性。In practical applications, for example, a user needs to handle a certain service. First, the user can send a service handling request to the service processing server through the client. The service handling request may include the first voice data and the first video data. Then, the service processing server receives the service processing request sent by the client, and obtains first semantic information from the first voice data. The first semantic information may include a service identifier, which may be, for example, a service name, a service code, and so on. In addition, the first semantic information may reflect at least two keywords in the first text data corresponding to the first voice data, and the connection between the at least two keywords. That is to say, the first semantic information is determined based on the connection between at least two keywords and at least two keywords. Then, the service processing server determines the second video data matching the second voice data from the video database. For example, the second voice data is voice data that instructs the user to perform a certain service processing process, and the second video data may be silent video data of a video robot simulating a human with a mouth opened, closed, eyes turned, and a smile on the face. Or, the second voice data is voice data indicating that the user's current network connection fails, and the second video data may be silent video data that the video robot simulates frustration, disappointment, etc. of a human. The service processing server synthesizes the silent second video data and the second voice data, obtains the synthesized data, and sends the synthesized data to the client. The user can see the video robot talking with him through the client. In this way, by synthesizing video data and voice data, real-time video calls can be made with users on the client side and the user experience can be improved. Since each keyword in the text data corresponding to the first voice data is extracted when the first semantic information in the first voice data is obtained, and the association relationship between each keyword is obtained, more accuracy can be achieved To determine the meaning of the first voice data, that is, the accuracy of voice recognition is higher, so as to achieve a more accurate reply to the second voice data, and to improve the accuracy and flexibility of the video robot call.
请参见图2,图2是本申请实施例提供的一种数据处理方法的流程示意图。如图1所示,该方法包括:Please refer to FIG. 2, which is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 1, the method includes:
S101,接收客户端所发送的业务办理请求。S101: Receive a service processing request sent by a client.
其中,业务办理请求包括第一语音数据和第一视频数据,第一语音数据是通过客户端对用户说话的声音进行声音录制得到的,第一视频数据是通过客户端对用户的姿态、表情、动作等数据进行视频录制得到的。客户端可以是指用户用于进行业务处理的终端。第一语音数据中可以包含业务标识,业务标识可以包括业务名称、业务名称缩写、业务代码等用于唯一地指示业务的标识。业务是指用户需要办理的服务类型,例如购买产险、银行贷款、银行卡办理、信用卡办理,等等。或者,业务也可以包括用户需要的服务,例如银行卡余额查询、信用卡额度查询,等等。Among them, the service processing request includes the first voice data and the first video data. The first voice data is obtained through the client's voice recording of the user's voice, and the first video data is the user's posture, facial expression, Actions and other data are obtained by video recording. The client may refer to a terminal used by a user for service processing. The first voice data may include a service identifier, and the service identifier may include a service name, an abbreviation of the service name, a service code, etc., which are used to uniquely indicate the service. Business refers to the types of services that users need to handle, such as purchasing property insurance, bank loans, bank card processing, credit card processing, and so on. Alternatively, the business may also include services required by the user, such as bank card balance inquiry, credit card limit inquiry, and so on.
可选的,用户可以通过客户端发送通话请求,业务处理服务器获取到该通话请求,根据该通话请求建立与客户端之间的通话连接,通过该通话连接接收客户端所发送的业务办理请求。Optionally, the user may send a call request through the client, the service processing server obtains the call request, establishes a call connection with the client according to the call request, and receives the service processing request sent by the client through the call connection.
这里,业务处理服务器可以对应多个视频分机,每个视频分机的功能相同,通话请求中可以包括视频分机号,则根据该视频分机号为客户端分配与该视频分机号对应的视频分机,从而实现与客户端之间的通话连接。若通话请求中不包含视频分机号,则可以为客户端分配一个空闲的视频分机。具体的,可以根据各个空闲视频分机的等待时间为客户端发送的呼叫请求匹配视频分机,例如,可以将等待时间最长的一个空闲视频分机匹配给客户端,或者,将等待时间最短的一个空闲视频分机匹配给客户端,等等。通过为客户端分配空闲的视频分机,可以提高与客户端建立通话连接的效率,从而提高后续业务处理效率。通话连接包括视频连接、语音连接,等等。视频连接用于获取客户端发送的视频数据、语音连接用于获取客户端发送的语音数据。其中,若通话连接为语音连接时,客户端可以通过其他方式发送第一视频数据给数据处理服务器,例如,可以将第一视频数据通过视频传输的方式发送至数据处理服务器等等。Here, the service processing server can correspond to multiple video extensions, and each video extension has the same function. The call request can include the video extension number, and the client will be assigned the video extension corresponding to the video extension number according to the video extension number. Realize the call connection with the client. If the video extension number is not included in the call request, an idle video extension can be assigned to the client. Specifically, the video extension can be matched to the call request sent by the client according to the waiting time of each idle video extension. For example, the idle video extension with the longest waiting time can be matched to the client, or the idle video extension with the shortest waiting time can be matched to the client. The video extension is matched to the client, and so on. By assigning idle video extensions to the client, the efficiency of establishing a call connection with the client can be improved, thereby improving the efficiency of subsequent business processing. Call connection includes video connection, voice connection, and so on. The video connection is used to obtain the video data sent by the client, and the voice connection is used to obtain the voice data sent by the client. Wherein, if the call connection is a voice connection, the client can send the first video data to the data processing server in other ways, for example, the first video data can be sent to the data processing server by video transmission.
具体实现中,客户端在与业务处理服务器建立通话连接时,客户端可以向业务处理服务器发送呼叫请求,可以通过网络交换、防火墙等对该呼叫请求进行验证,例如验证该呼叫请求是否安全,是否携带病毒,以及该呼叫请求是否为业务处理服务器可识别的格式等等,在验证通过后建立与业务处理服务器之间的通话连接。In specific implementation, when the client establishes a call connection with the service processing server, the client can send a call request to the service processing server, and the call request can be verified through network switching, firewall, etc., such as verifying whether the call request is safe or not. Carrying viruses, and whether the call request is in a format that can be recognized by the service processing server, etc., establish a call connection with the service processing server after the verification is passed.
可选的,在客户端与业务处理服务器建立通话连接后,业务处理服务器可以将预先存储的表示欢迎用户的语音数据发送至客户端,用户可以根据该语音数据确定该通话连接建立成功,从而发送业务办理请求。例如,预先存储的表示欢迎用户的语音数据可以为“您好,欢迎来电,请问有什么能为您服务的吗”。可选的,在节省存储空间的情况下,可以将表示欢迎用户的文本数据存储在业务处理服务器中,业务处理服务器通过将该文本数据转换为语音数据后,再将该语音数据发送至客户端,由于文本数据占用的存储空间小于语音数据占用的存储空间,因此存储文本数据的方式可以节省存储空间。Optionally, after the client establishes a call connection with the service processing server, the service processing server may send pre-stored voice data indicating that the user is welcome to the client, and the user may determine that the call connection is established successfully based on the voice data, and then send Business handling request. For example, the pre-stored voice data indicating that the user is welcome can be "Hello, welcome to call, what can I do for you". Optionally, in the case of saving storage space, the text data indicating that the user is welcome can be stored in the service processing server, and the service processing server converts the text data into voice data, and then sends the voice data to the client , Since the storage space occupied by text data is less than the storage space occupied by voice data, the way of storing text data can save storage space.
S102,从第一语音数据中获取第一语义信息,根据第一语义信息确定与第一语义信息匹配的第二语音数据。S102: Acquire first semantic information from the first voice data, and determine, according to the first semantic information, second voice data that matches the first semantic information.
其中,第二语音数据是针对第一语音数据的回复数据,第一语义信息用于反映第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系。通过获取第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系,相较于只根据关键词确定第一语音数据的含义而言,由于不仅结合了关键词,还结合了各个关键词之间的关联关系,可以实现更准确的确定第一语音数据的含义,确定出的第一语义信息更准确,即语音识别准确度更高,从而匹配得到的第二语音数据更准确。Wherein, the second voice data is reply data for the first voice data, and the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the difference between the at least two keywords connection relation. By acquiring at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, compared to determining the meaning of the first voice data only based on keywords, It not only combines keywords, but also combines the association relationship between each keyword, which can realize more accurate determination of the meaning of the first speech data, and the determined first semantic information is more accurate, that is, the accuracy of speech recognition is higher. The second voice data obtained by the matching is more accurate.
可选的,从第一语音数据中获取第一语义信息的一种方法可以为:对第一语音数据进行转换,得到第一语音数据对应的第一文本数据;对第一文本数据进行关键词提取,得到至少两个关键词;获取至少两个关键词的词义信息;根据至少两个关键词得到词语组合,获取词语组合的组合词义信息;根据词义信息和组合词义信息确定至少两个关键词之间的关联关系;根据至少两个关键词和至少两个关键词之间的关联关系确定第一文本数据对应的语义信息,作为第一语义信息。Optionally, a method for obtaining the first semantic information from the first voice data may be: converting the first voice data to obtain the first text data corresponding to the first voice data; and applying keywords to the first text data Extract to obtain at least two keywords; obtain word meaning information of at least two keywords; obtain word combinations according to at least two keywords, obtain combined word meaning information of word combinations; determine at least two keywords based on word meaning information and combined word meaning information The association relationship between the at least two keywords and the association relationship between the at least two keywords to determine the semantic information corresponding to the first text data as the first semantic information.
这里,由于第一语音数据为语音类型的数据,可以将语音类型的数据转换为文本类型的数据,得到第一文本数据。通过对第一文本数据进行关键词提取,得到至少两个关键词。可选的,业务处理平台可以对第一文本数据进行分词处理,将第一文本数据划分为至少一个分词;获取停用词集合,停用词集合中包括至少一个与业务无关的词语;在停用词集合中查找与该至少一个分词相匹配的目标词语;删除该至少一个分词中的目标词语;对删除该目标词语后的至少一个分词进行关键词提取,得到至少两个关键词。Here, since the first voice data is voice-type data, the voice-type data can be converted into text-type data to obtain the first text data. By extracting keywords from the first text data, at least two keywords are obtained. Optionally, the business processing platform may perform word segmentation processing on the first text data, and divide the first text data into at least one word segmentation; obtain a stop word set, and the stop word set includes at least one word unrelated to the business; Search for a target word that matches the at least one participle in the word set; delete the target word in the at least one participle; perform keyword extraction on at least one participle after deleting the target word to obtain at least two keywords.
例如,第一文本数据为“我想购买车险,但资金不足需要先贷款”,分词处理的结果即为“我想购买车险但资金不足需要先贷款”,从而分成了10个分词,然后将这10个分词分别与停用词集合中的各个停用词进行匹配,若匹配到“我”、“想”、“但”、“需要”这4个分词,则删除这4个分词,从而得到“购买车险资金不足先贷款”,对“购买车险资金不足先贷款”进行关键词提取,得到6个关键词“购买”、“车险”、“资金”、“不足”、“先”、“贷款”。具体实现中,可以采用ASR技术或者其他技术将语音数据转换为文本数据,从而提取文本数据中的关键词。For example, the first text data is "I want to buy auto insurance, but I need a loan if I don’t have enough funds." The result of word segmentation processing is "I want to buy auto insurance but I don’t have enough funds and need a loan." The 10 participles are matched with each stop word in the stop word set. If the 4 participles of "I", "想", "but", and "need" are matched, these 4 participles will be deleted to obtain "Buy auto insurance funds before loan", extract keywords for "buy auto insurance funds before loan", get 6 keywords "buy", "auto insurance", "funds", "insufficient", "first", "loan" ". In specific implementation, ASR technology or other technologies can be used to convert voice data into text data, thereby extracting keywords in the text data.
在对第一文本数据进行关键词提取,得到至少两个关键词之后,获取这6个关键词的词义信息;根据6个关键词得到词语组合,例如词语组合为“购买车险资金不足先贷款”,获取词语组合的组合词义信息;根据6个关键词的词义信息和组合词义信息确定至少两个关键词之间的关联关系,可以得到第一文本数据的含义为“先贷款,再购买车险”;根据至少两个关键词和至少两个关键词之间的关联关系确定第一文本数据对应的语义信息,作为第一语义信息,则第一语义信息为“贷款,购买车险”。After keyword extraction is performed on the first text data to obtain at least two keywords, the semantic information of these 6 keywords is obtained; a word combination is obtained based on the 6 keywords, for example, the word combination is "Buy car insurance funds, loan first" , Obtain the combined word meaning information of the word combination; determine the association relationship between at least two keywords according to the word meaning information of the 6 keywords and the combined word meaning information, and the meaning of the first text data can be obtained as "loan first, then buy auto insurance" ; Determine the semantic information corresponding to the first text data according to the at least two keywords and the association relationship between the at least two keywords, as the first semantic information, then the first semantic information is "loan, purchase car insurance."
可选的,从第一语音数据中获取第一语义信息的另一种方法可以为:对第一语音数据进行识别,得到第一语音数据中的至少两个关键词。例如第一语音数据为:“我想购买车险,但资金不足需要先贷款”,识别得到的至少两个关键词包括“购买”、“车险”、“资金”、“不足”、“先”、“贷款”,则获取到第一语义信息为“贷款,购买车险”。具体的,可以根据具体需求选择将第一语音数据转换为文本数据进行关键词提取,或者对第一语音数据进行语音识别得到至少两个关键词。例如,若语音识别的成本较低,则在节省成本的情况下,采用语音识别;或者,若语音数据转换为文本数据进行关键词提取的准确度较高,则在提高识别准确度的情况下,采用语音数据转换为文本数据进行关键词提取。Optionally, another method for obtaining the first semantic information from the first voice data may be: recognizing the first voice data to obtain at least two keywords in the first voice data. For example, the first voice data is: "I want to buy auto insurance, but I need a loan first if I don’t have enough funds." At least two keywords identified include "buy", "auto insurance", "funds", "insufficient", "first", For "loan", the first semantic information obtained is "loan, purchase auto insurance". Specifically, the first voice data may be converted into text data for keyword extraction according to specific requirements, or voice recognition may be performed on the first voice data to obtain at least two keywords. For example, if the cost of speech recognition is low, use speech recognition to save costs; or, if the accuracy of the conversion of speech data into text data for keyword extraction is high, then under the condition of improving the recognition accuracy , Use voice data to convert to text data for keyword extraction.
可选的,根据第一语义信息确定与第一语义信息匹配的第二语音数据的方法可以包括几下步骤:Optionally, the method for determining second voice data matching the first semantic information according to the first semantic information may include several steps:
一、获取至少两个关键词与语料库中的多个文本数据之间的相似度,得到多个第一相似度。1. Obtain the similarity between at least two keywords and multiple text data in the corpus to obtain multiple first similarities.
这里,语料库为业务处理服务器对应的数据库,语料库中可以包含与业务办理相关的文本数据,例如办理业务的具体流程信息;也可以包括与业务办理无关的文本数据。具体实现中,可以采用相似度计算方法计算各个关键词与语料库中的各个文本数据之间的相似度,从而得到多个第一相似度,相似度计算方法可以包括皮尔逊相关系数法、Cosine相似度法等,此处不做限定。Here, the corpus is a database corresponding to the business processing server. The corpus can contain text data related to business processing, such as specific process information for business processing; it can also include text data that has nothing to do with business processing. In specific implementation, the similarity calculation method can be used to calculate the similarity between each keyword and each text data in the corpus to obtain multiple first similarities. The similarity calculation methods can include Pearson correlation coefficient method and Cosine similarity There are no restrictions on the degree of law, etc. here.
二、根据至少两个关键词之间的关联关系确定与关联关系对应的文本数据,获取关联关系对应的文本数据与多个文本数据之间的相似度,得到多个第二相似度。2. The text data corresponding to the association relationship is determined according to the association relationship between at least two keywords, and the similarity between the text data corresponding to the association relationship and the multiple text data is obtained, and multiple second similarities are obtained.
这里,由于至少两个关键词之间的关联关系是指至少两个关键词之间的联系,例如可以表示关键词对应的业务之间的先后顺序等,因此可以根据该联系确定关联关系对应的文本数据,例如至少两个关键词包括“购买”、“车险”、“资金”、“不足”、“先”、“贷款”,在不考虑关键词之间的联系的情况下,得到的包含业务的文本数据为“购买车险贷款”,在考虑关键词之间的联系的情况下,得到的包含业务的关联关系对应的文本数据为“贷款购买车险”,即先处理贷款业务后,再处理购买车险业务。具体实现中,可以采用相似度计算方法计算关联关系对应的文本数据与语料库中的多个文本数据之间的相似度,从而得到多个第二相似度,相似度计算方法可以包括皮尔逊相关系数法、Cosine相似度法等,此处不做限定。Here, since the association relationship between at least two keywords refers to the relationship between at least two keywords, for example, it can indicate the order of the businesses corresponding to the keywords, etc., so the corresponding relationship can be determined according to the relationship. Text data, such as at least two keywords including "buy", "car insurance", "funds", "insufficient", "first", and "loan". Without considering the relationship between the keywords, the obtained contains The text data of the business is "purchase auto insurance loan". In the case of considering the relationship between keywords, the obtained text data corresponding to the association relationship of the business is "purchase auto insurance loan", that is, the loan business is processed first, and then processed Buy auto insurance business. In specific implementation, the similarity calculation method can be used to calculate the similarity between the text data corresponding to the association relationship and the multiple text data in the corpus to obtain multiple second similarities. The similarity calculation method can include Pearson correlation coefficient Method, Cosine similarity method, etc., are not limited here.
三、将多个第一相似度中最大相似度对应的文本数据确定为第一目标文本数据,以及将多个第二相似度中最大相似度对应的文本数据确定为第二目标文本数据。3. Determine the text data corresponding to the maximum similarity among the plurality of first similarities as the first target text data, and determine the text data corresponding to the maximum similarity among the plurality of second similarities as the second target text data.
这里,由于语料库中有多个文本数据,因此计算至少两个关键词中的每个关键词与语料库中多个文本数据中的每个文本数据之间的相似度,可得到一个第一相似度,从而根据至少两个关键词和多个文本数据,可以得到多个第一相似度。例如,关键词的数量为n1,语料库中文本数据的数量为m1,则可计算得到n1*m1个第一相似度,对应的,关联关系对应的文本数据的数量为n2,语料库中文本数据的数量为m1,则可计算得到n2*m1个第二相似度。则可以比较n1*m1个第一相似度的大小,确定出n1*m1个第一相似度中的最大相似度,以及比较n2*m1个第二相似度的大小,确定出n2*m1个第二相似度中的最大相似度,从而将多个第一相似度中的最大相似度对应的文本数据确定为第一目标文本数据,以及将多个第二相似度中最大相似度对应的文本数据确定为第二目标文本数据。Here, since there are multiple text data in the corpus, the similarity between each keyword in the at least two keywords and each text data in the multiple text data in the corpus is calculated to obtain a first similarity Therefore, multiple first similarities can be obtained based on at least two keywords and multiple text data. For example, if the number of keywords is n1 and the number of text data in the corpus is m1, then n1*m1 first similarities can be calculated. Correspondingly, the number of text data corresponding to the association relationship is n2, and the number of text data in the corpus is n2. If the number is m1, n2*m1 second similarities can be calculated. Then you can compare the size of the n1*m1 first similarities, determine the maximum similarity among the n1*m1 first similarities, and compare the size of the n2*m1 second similarities, and determine the n2*m1 first similarities. The maximum similarity among the second similarities, so that the text data corresponding to the maximum similarity among the multiple first similarities is determined as the first target text data, and the text data corresponding to the maximum similarity among the multiple second similarities is determined Determined as the second target text data.
四、根据第一目标文本数据与第二目标文本数据确定第二文本数据。4. Determine the second text data according to the first target text data and the second target text data.
这里,若第一目标文本数据与第二目标文本数据相同,则可将第一目标文本数据确定为第二文本数据;若第一目标文本数据与第二目标文本数据不相同,则可将第二目标文本数据确定为第二文本数据。Here, if the first target text data is the same as the second target text data, the first target text data can be determined as the second text data; if the first target text data is different from the second target text data, the first target text data can be The second target text data is determined as the second text data.
五、对第二文本数据进行转换,得到第二文本数据对应的语音数据,作为第二语音数据。5. Convert the second text data to obtain the voice data corresponding to the second text data as the second voice data.
具体实现中,可以采用自然语言处理技术(Natural Language Processing,NLP)对文本数据进行处理,获取到第一语义信息,并且确定与第一语义信息匹配的第二语音数据。这里,可以采用TTS技术或者其他技术将文本数据转换为语音数据。通过将第二文本数据转换为第二语音数据,用户可以通过客户端获取到第二语音数据,从而根据第二语音数据回复第一多媒体数据,以实现业务处理平台根据用户回复的第一多媒体数据处理相应业务。通过将第二文本数据转换为语音数据,用户可以更为直观的获知语音数据的内容,相较于用户直接查看第二文本数据的方式而言,将文本数据转换为语音数据的方式可以提高用户的查看效率,从而提高业务处理效率。In specific implementation, natural language processing technology (Natural Language Processing, NLP) may be used to process text data, obtain first semantic information, and determine second voice data matching the first semantic information. Here, TTS technology or other technologies can be used to convert text data into voice data. By converting the second text data into the second voice data, the user can obtain the second voice data through the client, and then reply the first multimedia data according to the second voice data, so as to realize the service processing platform according to the first reply of the user. Corresponding services for multimedia data processing. By converting the second text data into voice data, the user can learn the content of the voice data more intuitively. Compared with the way the user directly views the second text data, the way of converting the text data into voice data can improve the user The efficiency of viewing, thereby improving the efficiency of business processing.
S103,从视频数据库中确定与第二语音数据匹配的第二视频数据。S103: Determine second video data matching the second voice data from the video database.
这里,视频数据库中包括多个视频数据。例如可以包括多种类型的视频数据,具体可以包括视频机器人模拟人说话时嘴巴张开、闭合、眼睛转动、以及面带微笑等的无声视频 数据,或者包括视频机器人模拟人沮丧、失望等的无声视频数据,或者包括视频机器人模拟人愧疚等表情的无声视频数据等类型。具体实现中,可以预先将该多种类型的视频数据存储在视频数据库中,便于后续使用。Here, the video database includes multiple video data. For example, it can include multiple types of video data, specifically, it can include silent video data such as a video robot simulating a person’s mouth opening, closing, eye turning, and a smile, or a video robot simulating a person’s frustration, disappointment, etc. Video data, or silent video data including video robots simulating human expressions such as guilt, etc. In a specific implementation, the multiple types of video data can be stored in the video database in advance, so as to facilitate subsequent use.
可选的,从视频数据库中确定与第二语音数据匹配的第二视频数据的方法可以为:获取第二文本数据的语义场景,以及视频数据库中每个视频数据的应用场景;从应用场景中确定出与语义场景匹配的目标应用场景;将目标应用场景对应的视频数据确定为第二视频数据。Optionally, the method for determining the second video data matching the second voice data from the video database may be: obtaining the semantic scene of the second text data and the application scenario of each video data in the video database; Determine the target application scene matching the semantic scene; determine the video data corresponding to the target application scene as the second video data.
其中,第二文本数据为与第二语音数据对应的文本数据,即第二文本数据与第二语音数据的含义相同,但两种数据的表现形式不同,第二文本数据的表现形式为文字形式,而第二语音数据的表现形式为声音形式。第二文本数据的语义场景是指第二文本数据的含义,例如可以包括具体的业务办理流程、指示用户进行等待的指示信息、以及办理故障提示信息等等,办理故障例如可以包括网络连接失败、服务器繁忙等等。视频数据的应用场景可以根据视频数据的类型确定。例如语义场景为具体的业务办理流程,则与该语义场景匹配的目标应用场景可以为视频机器人模拟人说话时嘴巴张开、闭合、眼睛转动、以及面带微笑等的无声视频数据。或者语义场景为指示用户进行等待的指示信息,则与该语义场景匹配的目标应用场景可以为视频机器人模拟人愧疚等表情的无声视频数据。或者语义场景为办理故障提示信息,则与该语义场景匹配的目标应用场景可以为视频机器人模拟人沮丧、失望等的无声视频数据。通过确定与语义场景匹配的目标应用场景,从而将目标应用场景对应的视频数据确定为第二视频数据,可以使得后续用户看到的合成数据更自然,提高人机交互的趣味性。Among them, the second text data is text data corresponding to the second voice data, that is, the second text data and the second voice data have the same meaning, but the two types of data have different manifestations, and the second text data is in text form. , And the second voice data is in the form of sound. The semantic scenario of the second text data refers to the meaning of the second text data. For example, it can include specific business processing procedures, instructions to instruct users to wait, and troubleshooting prompt information, etc. Handling failures can include, for example, network connection failure, The server is busy and so on. The application scenario of the video data can be determined according to the type of the video data. For example, the semantic scenario is a specific business processing process, and the target application scenario that matches the semantic scenario may be silent video data such as a video robot simulating a human with a mouth opened, closed, eyes turned, and a smile on the face. Or the semantic scene is instruction information that instructs the user to wait, and the target application scene that matches the semantic scene may be silent video data in which the video robot simulates human expressions such as guilt. Or the semantic scene is the fault prompt information, and the target application scene that matches the semantic scene can be silent video data that the video robot simulates human frustration, disappointment, etc. By determining the target application scene matching the semantic scene, the video data corresponding to the target application scene is determined as the second video data, which can make the subsequent user see the synthesized data more natural and improve the interest of human-computer interaction.
S104,对第二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端,根据客户端回复的第一多媒体数据进行业务处理。S104: Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client.
其中,第一多媒体数据是客户端针对合成数据的回复数据。业务处理服务器将合成数据发送至客户端,用户通过客户端获知该合成数据后,用户会根据该合成数据进行相应的回复,例如回答合成数据中的问题,根据合成数据中的提示信息填写用户的身份信息、上传相应身份证明文件等等。客户端通过采集用户根据该合成数据进行回复的数据,例如对用户回复的语音进行录制,以及对用户的动作和表情进行录制,得到第一多媒体数据。客户端将第一多媒体数据发送至业务处理服务器,业务处理服务器根据该第一多媒体数据进行业务处理。Among them, the first multimedia data is the reply data of the client to the synthesized data. The business processing server sends the synthetic data to the client. After the user learns the synthetic data through the client, the user will respond according to the synthetic data, such as answering questions in the synthetic data, and filling in the user's information according to the prompt information in the synthetic data. Identity information, upload corresponding identity documents, etc. The client terminal obtains the first multimedia data by collecting the data that the user replies based on the synthesized data, for example, recording the user's reply voice, and recording the user's actions and expressions. The client sends the first multimedia data to the service processing server, and the service processing server performs service processing according to the first multimedia data.
可选的,对第二视频数据与第二语音数据进行合成处理,得到合成数据的方法可以为:获取第二语音数据的语音时长;从第二视频数据中截取与语音时长相等的视频数据,作为候选视频数据;对第二语音数据和候选视频数据进行合成处理,得到合成数据。Optionally, performing synthesis processing on the second video data and the second voice data to obtain the synthesized data may be: obtaining the voice duration of the second voice data; intercepting video data equal to the voice duration from the second video data, As the candidate video data; perform synthesis processing on the second voice data and the candidate video data to obtain synthesized data.
这里,若第二视频数据的视频时长等于第二语音数据的语音时长,则将第二视频数据确定为候选视频数据。若第二视频数据的视频时长大于第二语音数据的语音时长,则可以从第二视频数据中截取与第二语音数据的语音时长相等的视频数据,作为候选视频数据。例如第二视频数据的视频时长为5秒,第二语音数据的语音时长为3秒,则可以从5秒第二视频数据钟获取3秒视频数据,作为候选视频数据。若第二视频数据的视频时长小于第二语音数据的语音时长,则可以将多个第二视频数据进行连接得到与第二语音数据的语音时长相等的视频数据,作为候选视频数据。例如第二视频数据的视频时长为3秒,第二语音数据的语音时长为6秒,则可以将该相同的3秒第二视频数据进行连接,得到6秒的视频数据,作为候选视频数据。可以对第二语音数据和候选视频数据进行合成处理后,得到有声音的视频数据,作为合成数据,该合成数据的声音即为第二语音数据。或者将第二语音数据和候选视频数据同时发送至客户端,客户端同时播放第二语音数据和候选视频数据,则合成数据包括第二语音数据和候选视频数据。可知,业务处理服务器在发送语音数据至 客户端时,都会从视频数据库中确定与该语音数据匹配的视频数据,从而根据该语音数据和该视频数据得到合成数据后,发送至客户端,形成与客户端之间的对话,使得人机交互更自然。Here, if the video duration of the second video data is equal to the voice duration of the second voice data, the second video data is determined as the candidate video data. If the video duration of the second video data is greater than the voice duration of the second voice data, video data equal to the voice duration of the second voice data may be intercepted from the second video data as candidate video data. For example, the video duration of the second video data is 5 seconds, and the voice duration of the second voice data is 3 seconds, then 3 seconds of video data can be obtained from the 5 second second video data clock as candidate video data. If the video duration of the second video data is less than the voice duration of the second voice data, multiple second video data may be connected to obtain video data equal to the voice duration of the second voice data as candidate video data. For example, the video duration of the second video data is 3 seconds, and the voice duration of the second voice data is 6 seconds, then the same 3-second second video data can be connected to obtain 6-second video data as candidate video data. After the second voice data and the candidate video data are synthesized, the video data with sound may be obtained as the synthesized data, and the sound of the synthesized data is the second voice data. Or the second voice data and the candidate video data are sent to the client at the same time, and the client plays the second voice data and the candidate video data at the same time, and the synthesized data includes the second voice data and the candidate video data. It can be seen that when the service processing server sends voice data to the client, it will determine the video data matching the voice data from the video database, and then send the synthesized data to the client based on the voice data and the video data. The dialogue between clients makes human-computer interaction more natural.
本申请实施例中,通过接收客户端所发送的业务办理请求,业务办理请求包括第一语音数据和第一视频数据;通过从第一语音数据中获取第一语义信息,根据第一语义信息确定与第一语义信息匹配的第二语音数据。由于第一语义信息可以反映第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系,因此得到的第一语义信息可以更准确的表示第一语音数据的含义,即语音识别的准确度更高,从而匹配得到的第二语音数据更准确,可以提高视频机器人通话的准确性。从视频数据库中确定与第二语音数据匹配的第二视频数据,对第二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端,根据客户端回复的第一多媒体数据进行业务处理。通过对用户发送的语音数据进行实时处理以及回复,可以提高业务处理的效率。通过将第二语音数据与第二视频数据进行合成后发送至客户端,用户通过客户端可以看到视频机器人与自己进行视频通话,使得人机交互更自然,从而增加人机交互的趣味性,提升用户体验。In this embodiment of the application, by receiving the service processing request sent by the client, the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first The meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. The second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business. By processing and replying to the voice data sent by users in real time, the efficiency of business processing can be improved. By synthesizing the second voice data and the second video data and sending it to the client, the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.
可选的,请参见图3,图3是本申请实施例提供的一种数据处理方法的流程示意图。如图3所示,该方法包括:Optionally, please refer to FIG. 3, which is a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in Figure 3, the method includes:
S201,接收客户端所发送的业务办理请求。S201: Receive a service processing request sent by a client.
S202,从第一语音数据中获取第一语义信息,根据第一语义信息确定与第一语义信息匹配的第二语音数据。S202: Acquire first semantic information from the first voice data, and determine, according to the first semantic information, second voice data that matches the first semantic information.
S203,从视频数据库中确定与第二语音数据匹配的第二视频数据。S203: Determine second video data matching the second voice data from the video database.
S204,对第二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端。S204: Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, and send the synthesized data to the client.
这里,步骤S201~S204的具体实现方式可参考图2对应的实施例中步骤S101~S104的描述,此处不再赘述。Here, for the specific implementation of steps S201 to S204, reference may be made to the description of steps S101 to S104 in the embodiment corresponding to FIG. 2, which will not be repeated here.
S205,从第一视频数据中截取客户端对应的用户的第一图像。S205: Intercept the first image of the user corresponding to the client from the first video data.
这里,第一视频数据是通过客户端对用户的姿态、表情、动作等数据进行视频录制得到的。可知,第一视频数据中包括用户的面部图像。业务处理服务器可以每隔预设时间对第一视频数据进行截取,得到包含用户面部的第一图像,即得到客户端对应的用户的第一图像。例如,可以每隔0.5秒钟截取第一视频数据中的图像,得到第一图像。例如第一视频数据的时长为2秒,则获取到用户的第一图像的数量为4张。Here, the first video data is obtained through video recording of the user's posture, expression, and action data by the client. It can be seen that the first video data includes the facial image of the user. The service processing server may intercept the first video data every preset time to obtain the first image containing the user's face, that is, obtain the first image of the user corresponding to the client. For example, the image in the first video data may be intercepted every 0.5 seconds to obtain the first image. For example, if the duration of the first video data is 2 seconds, the number of first images of the user acquired is 4.
S206,根据第一图像验证客户端的合法性。S206: Verify the legitimacy of the client according to the first image.
这里,根据第一图像验证客户端的合法性可以是指验证第一图像中用户的面部图像与业务处理服务器存储的用户图像是否为同一用户的面部图像,若是,则确定客户端具有合法性。若否,则确定客户端不具有合法性,则发送用于指示用户进行姿态调整的调整信息至客户端;获取客户端针对调整信息发送的第二多媒体数据;从第三视频数据中截取用户的第二图像;根据第二图像验证客户端的合法性。Here, verifying the legitimacy of the client according to the first image may refer to verifying whether the facial image of the user in the first image and the user image stored by the service processing server are the facial image of the same user, and if so, determining the legitimacy of the client. If not, it is determined that the client is not legal, and the adjustment information used to instruct the user to adjust the posture is sent to the client; the second multimedia data sent by the client for the adjustment information is obtained; and the third video data is intercepted The second image of the user; verify the legitimacy of the client according to the second image.
其中,第二多媒体数据包括第三视频数据,业务处理服务器存储的用户的面部信息可以根据该用户在该业务处理服务器办理的历史业务时存储的面部信息。例如,用户曾在该业务处理服务器办理了银行卡,则业务处理服务器存储的用户的面部信息可以为用户曾在该业务处理服务器办理该银行卡时预留的用户面部信息。若用户在该业务处理服务器未办理历史业务,或者用户在业务处理服务器办理历史业务时未存储面部信息,则可以从其他存储有用户面部信息的服务器中获取用户的面部信息,例如可以从公安部、民政部等机构对应的服务器中获取用户的面部信息。业务处理服务器验证第一图像中用户的面部图像与 业务处理服务器存储的用户图像不为同一用户的面部图像时,则确定客户端不具有合法性,则发送用于指示用户进行姿态调整的调整信息至客户端,以使用户根据该调整信息进行姿态调整。例如,用户的面部未对准客户端的摄像头时,则调整后的用户面部对准客户端的摄像头;或者,客户端的摄像头中包括用户A和用户B的情况下,且用户A为需要办理业务的用户,则调整后的客户端的摄像头中只包括用户A。Wherein, the second multimedia data includes the third video data, and the facial information of the user stored by the service processing server may be based on the facial information stored by the user when the historical service is handled by the service processing server. For example, if the user has handled a bank card in the business processing server, the user's facial information stored by the business processing server may be the user's facial information reserved when the user has handled the bank card in the business processing server. If the user has not processed the historical business on the business processing server, or the user has not stored facial information when processing the historical business on the business processing server, the user’s facial information can be obtained from other servers that store the user’s facial information, for example, from the Ministry of Public Security , Ministry of Civil Affairs and other institutions to obtain the user’s facial information from the corresponding server. When the business processing server verifies that the user's facial image in the first image and the user image stored by the business processing server are not the same user's facial image, it determines that the client is not legal, and then sends adjustment information for instructing the user to adjust the posture To the client, so that the user can adjust the posture according to the adjustment information. For example, when the user's face is not aligned with the client's camera, the adjusted user's face is aligned with the client's camera; or, when the client's camera includes user A and user B, and user A is a user who needs to handle business , The adjusted client’s camera only includes user A.
具体的,业务处理服务器获取客户端针对调整信息发送的第二多媒体数据,第二多媒体数据包括第三视频数据;根据第三视频数据获取用户的第二图像;从第三视频数据中截取用户的第二图像;根据第二图像验证客户端的合法性。第二图像中包括用户的面部图像,若第二图像与业务处理服务器中存储的用户面部图像为同一用户的面部图像,则客户端具有合法性,则根据客户端回复的第一多媒体数据进行业务处理。若第二图像与业务处理服务器中存储的用户面部图像不为同一用户的面部图像,则客户端不具有合法性,结束业务处理,输出用于指示用户在业务处理服务器对应的人工业务办理处进行业务办理,并结束业务处理。Specifically, the service processing server obtains the second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data; obtains the user's second image according to the third video data; and obtains the user's second image from the third video data. Intercept the user's second image; verify the legitimacy of the client according to the second image. The second image includes the user’s facial image. If the second image and the user’s facial image stored in the service processing server are the same user’s facial image, the client has legitimacy, and the client will respond according to the first multimedia data Perform business processing. If the second image and the user's facial image stored in the business processing server are not the same user's facial image, the client has no legality and ends the business processing, and the output is used to instruct the user to perform the processing at the manual business processing office corresponding to the business processing server Business processing and termination of business processing.
S207,若客户端具有合法性,根据客户端回复的第一多媒体数据进行业务处理。S207: If the client has legitimacy, perform service processing according to the first multimedia data replies from the client.
这里,若客户端具有合法性,即第二图像与业务处理服务器中存储的用户面部图像为同一用户的面部图像的情况下,根据客户端回复的第一多媒体数据进行业务处理。Here, if the client has legitimacy, that is, when the second image and the user's facial image stored in the service processing server are the facial image of the same user, the service processing is performed according to the first multimedia data replies from the client.
在一种可能的实现方式中,在进行业务处理之前,业务处理服务器还可以根据第一多媒体数据对客户端的合法性进行二次验证时,例如可以从第一多媒体数据中获取用户回答问题时的第三图像,例如合成数据中有3个问题,在第一多媒体数据中截取用户回答该3个问题时的图像,得到包含用户的面部图像的第三图像。通过对第三图像进行微表情识别,从而根据用户回答问题时的微表情确定用户回答的问题的真实性。若通过微表情识别确定用户回答的问题真实性较高,则处理业务。若通过微表情识别确定用户回答的问题真实性较低,则发送用于二次验证用户身份的指示信息或再次输出用户回答问题时微表情异常的问题。若二次验证通过或用户再次回答该问题时的表情指示用户回答的问题真实性较高,则处理业务。若二次验证未通过或用户再次回答该问题时的表情指示用户回答的问题真实性较低,则输出用于指示用户在业务处理服务器对应的人工业务办理处进行业务办理,并结束处理该业务。通过微表情识别可以用于验证用户说话的内容的真实性,从而可以提高数据识别的准确性。通过获取第一视频数据中的第一图像,并发送第一图像至业务处理服务器进行验证,可以提高用户身份的真实性,以及业务处理服务器通过对第一多媒体数据中的第三图像进行微表情识别,可以识别用户回答的问题的真实性,从而实现二次验证用户的身份信息,提高业务处理的准确性。In a possible implementation manner, before performing service processing, the service processing server may also perform secondary verification on the legitimacy of the client according to the first multimedia data, for example, may obtain the user from the first multimedia data. The third image when answering the question, for example, there are 3 questions in the composite data, and the image when the user answers the 3 questions is intercepted in the first multimedia data to obtain the third image containing the user's facial image. By performing micro-expression recognition on the third image, the authenticity of the question answered by the user is determined according to the micro-expression when the user answers the question. If the authenticity of the question answered by the user is determined to be high through micro-expression recognition, then the business is processed. If it is determined through the micro-expression recognition that the authenticity of the question answered by the user is low, then the instruction information for re-verifying the user's identity is sent or the question with abnormal micro-expression when the user answers the question is output again. If the second verification is passed or the user's facial expression when answering the question again indicates that the authenticity of the question answered by the user is high, then the service is processed. If the secondary verification fails or the facial expression when the user answers the question again indicates that the authenticity of the question answered by the user is low, the output is used to instruct the user to handle the business at the manual business processing office corresponding to the business processing server, and end the processing of the business . The micro-expression recognition can be used to verify the authenticity of the content of the user's speech, so that the accuracy of data recognition can be improved. By acquiring the first image in the first video data, and sending the first image to the service processing server for verification, the authenticity of the user's identity can be improved, and the service processing server can perform the verification on the third image in the first multimedia data. Micro-expression recognition can identify the authenticity of the question answered by the user, thereby realizing the second verification of the user's identity information and improving the accuracy of business processing.
本申请实施例中,在进行业务处理之前,通过对客户端的合法性进行验证,即对用户的身份信息进行验证,在验证客户端具有合法性的情况下,表示用户的身份信息真实,则进行相应的业务处理;在验证客户端不具有合法性的情况下,通过输出调整信息提示用户进行姿态调整,实现对客户端合法性的验证,可以提高用户身份信息验证的真实性,从而提高业务处理的准确性。In this embodiment of the application, before performing business processing, the client’s legitimacy is verified, that is, the user’s identity information is verified. If the client’s legitimacy is verified, it means that the user’s identity information is true. Corresponding business processing; in the case of verifying that the client is not legal, the user is prompted to adjust the posture by outputting adjustment information to realize the verification of the legitimacy of the client, which can improve the authenticity of user identity information verification, thereby improving business processing Accuracy.
上面介绍了本申请实施例的方法,下面介绍本申请实施例的装置。The method of the embodiment of the present application is described above, and the device of the embodiment of the present application is described below.
参见图4,图4是本申请实施例提供的一种数据处理装置的组成结构示意图,上述数据处理装置可以是运行于计算机设备中的一个计算机程序(包括程序代码),例如该数据处理装置为一个应用软件;该装置可以用于执行本申请实施例提供的方法中的相应步骤。该装置40包括:4, FIG. 4 is a schematic diagram of the composition structure of a data processing device provided by an embodiment of the present application. The above data processing device may be a computer program (including program code) running in a computer device. For example, the data processing device is An application software; the device can be used to execute the corresponding steps in the method provided in the embodiments of this application. The device 40 includes:
数据获取模块401,用于接收客户端所发送的业务办理请求,该业务办理请求包括第一语音数据和第一视频数据;The data acquisition module 401 is configured to receive a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
语音匹配模块402,用于从该第一语音数据中获取第一语义信息,根据该第一语义信 息确定与该第一语义信息匹配的第二语音数据;该第二语音数据是针对该第一语音数据的回复数据,该第一语义信息用于反映该第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;The voice matching module 402 is configured to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first semantic information Voice data reply data, where the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
视频匹配模块403,用于从视频数据库中确定与该第二语音数据匹配的第二视频数据,该视频数据库中包括多个视频数据;The video matching module 403 is configured to determine second video data matching the second voice data from a video database, and the video database includes multiple video data;
业务处理模块404,用于对该第二视频数据与该第二语音数据进行合成处理,得到合成数据,将该合成数据发送至该客户端,根据该客户端回复的第一多媒体数据进行业务处理,该第一多媒体数据是该客户端针对该合成数据的回复数据。The service processing module 404 is used to synthesize the second video data and the second voice data to obtain synthetic data, send the synthetic data to the client, and perform processing according to the first multimedia data replies from the client For service processing, the first multimedia data is the reply data of the client to the composite data.
可选的,该语音匹配模块402,具体用于:Optionally, the voice matching module 402 is specifically used for:
对该第一语音数据进行转换,得到该第一语音数据对应的第一文本数据;Converting the first voice data to obtain first text data corresponding to the first voice data;
对该第一文本数据进行关键词提取,得到至少两个关键词;Performing keyword extraction on the first text data to obtain at least two keywords;
获取该至少两个关键词的词义信息;Obtain the word meaning information of the at least two keywords;
根据该至少两个关键词得到词语组合,获取该词语组合的组合词义信息;Obtain a word combination according to the at least two keywords, and obtain the combined word meaning information of the word combination;
根据该词义信息和该组合词义信息确定该至少两个关键词之间的关联关系;Determine the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;
根据该至少两个关键词和该至少两个关键词之间的关联关系确定该第一文本数据对应的语义信息,作为该第一语义信息。The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords as the first semantic information.
可选的,该语音匹配模块402,具体用于:Optionally, the voice matching module 402 is specifically used for:
获取该至少两个关键词与语料库中的多个文本数据之间的相似度,得到多个第一相似度;Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;
根据该至少两个关键词之间的关联关系确定与该关联关系对应的文本数据,获取该关联关系对应的文本数据与该多个文本数据之间的相似度,得到多个第二相似度;Determine the text data corresponding to the association relationship according to the association relationship between the at least two keywords, obtain the similarity between the text data corresponding to the association relationship and the plurality of text data, to obtain a plurality of second similarities;
将该多个第一相似度中最大相似度对应的文本数据确定为第一目标文本数据,以及将该多个第二相似度中最大相似度对应的文本数据确定为第二目标文本数据;Determining the text data corresponding to the maximum similarity in the plurality of first similarities as the first target text data, and determining the text data corresponding to the maximum similarity in the plurality of second similarities as the second target text data;
根据该第一目标文本数据与该第二目标文本数据确定该第二文本数据;Determining the second text data according to the first target text data and the second target text data;
对该第二文本数据进行转换,得到该第二文本数据对应的语音数据,作为该第二语音数据。The second text data is converted, and the voice data corresponding to the second text data is obtained as the second voice data.
可选的,该视频匹配模块403,具体用于:Optionally, the video matching module 403 is specifically used for:
获取该第二文本数据的语义场景,以及该视频数据库中每个视频数据的应用场景;Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;
从该应用场景中确定出与该语义场景匹配的目标应用场景;Determine the target application scenario that matches the semantic scenario from the application scenario;
将该目标应用场景对应的视频数据确定为该第二视频数据。The video data corresponding to the target application scene is determined as the second video data.
可选的,该业务处理模块404,具体用于:Optionally, the business processing module 404 is specifically used for:
获取该第二语音数据的语音时长;Acquiring the voice duration of the second voice data;
从该第二视频数据中截取与该语音时长相等的视频数据,作为候选视频数据;Intercepting video data equal to the voice duration from the second video data as candidate video data;
对该第二语音数据和该候选视频数据进行合成处理,得到该合成数据。Perform synthesis processing on the second voice data and the candidate video data to obtain the synthesized data.
可选的,该装置40还包括:合法性验证模块405,用于:Optionally, the device 40 further includes: a legality verification module 405, configured to:
从该第一视频数据中截取该客户端对应的用户的第一图像;Intercept the first image of the user corresponding to the client from the first video data;
根据该第一图像验证该客户端的合法性;Verify the legitimacy of the client according to the first image;
若该客户端具有合法性,则执行该根据该客户端回复的第一多媒体数据进行业务处理的步骤。If the client has legitimacy, the step of performing service processing according to the first multimedia data returned by the client is executed.
可选的,该装置40还包括:信息调整模块406,用于:Optionally, the device 40 further includes: an information adjustment module 406, configured to:
若该客户端不具有合法性,则发送用于指示该用户进行姿态调整的调整信息至该客户端;If the client does not have legitimacy, sending adjustment information for instructing the user to adjust the posture to the client;
获取该客户端针对该调整信息发送的第二多媒体数据,该第二多媒体数据包括第三视频数据;Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;
从该第三视频数据中截取该用户的第二图像;Intercept the second image of the user from the third video data;
根据该第二图像验证该客户端的合法性。Verify the legitimacy of the client according to the second image.
需要说明的是,图4对应的实施例中未提及的内容可参见方法实施例的描述,这里不再赘述。It should be noted that, for content not mentioned in the embodiment corresponding to FIG. 4, please refer to the description of the method embodiment, which will not be repeated here.
本申请实施例中,通过接收客户端所发送的业务办理请求,业务办理请求包括第一语音数据和第一视频数据;通过从第一语音数据中获取第一语义信息,根据第一语义信息确定与第一语义信息匹配的第二语音数据。由于第一语义信息可以反映第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系,因此得到的第一语义信息可以更准确的表示第一语音数据的含义,即语音识别的准确度更高,从而匹配得到的第二语音数据更准确,可以提高视频机器人通话的准确性。从视频数据库中确定与第二语音数据匹配的第二视频数据,对第二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端,根据客户端回复的第一多媒体数据进行业务处理。通过对用户发送的语音数据进行实时处理以及回复,可以提高业务处理的效率。通过将第二语音数据与第二视频数据进行合成后发送至客户端,用户通过客户端可以看到视频机器人与自己进行视频通话,使得人机交互更自然,从而增加人机交互的趣味性,提升用户体验。In this embodiment of the application, by receiving the service processing request sent by the client, the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first The meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. The second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business. By processing and replying to the voice data sent by users in real time, the efficiency of business processing can be improved. By synthesizing the second voice data and the second video data and sending it to the client, the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.
参见图5,图5是本申请实施例提供的一种计算机设备的组成结构示意图。如图5所示,上述计算机设备50可以包括:处理器501,网络接口504和存储器505,此外,上述计算机设备50还可以包括:用户接口503,和至少一个通信总线502。其中,通信总线502用于实现这些组件之间的连接通信。其中,用户接口503可以包括显示屏(Display)、键盘(Keyboard),可选用户接口503还可以包括标准的有线接口、无线接口。网络接口504可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器405可以是高速RAM存储器,也可以是非易失性的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器505可选的还可以是至少一个位于远离前述处理器501的存储装置。如图5所示,作为一种计算机可读存储介质的存储器505中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。Referring to FIG. 5, FIG. 5 is a schematic diagram of the composition structure of a computer device provided by an embodiment of the present application. As shown in FIG. 5, the foregoing computer device 50 may include: a processor 501, a network interface 504, and a memory 505. In addition, the foregoing computer device 50 may also include: a user interface 503, and at least one communication bus 502. Among them, the communication bus 502 is used to implement connection and communication between these components. The user interface 503 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 503 may also include a standard wired interface and a wireless interface. The network interface 504 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 405 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 505 may also be at least one storage device located far away from the aforementioned processor 501. As shown in FIG. 5, the memory 505, which is a computer-readable storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.
在图5所示的计算机设备50中,网络接口504可提供网络通讯功能;而用户接口503主要用于为用户提供输入的接口;而处理器501可以用于调用存储器505中存储的设备控制应用程序,以实现:In the computer device 50 shown in FIG. 5, the network interface 504 can provide network communication functions; the user interface 503 is mainly used to provide an input interface for the user; and the processor 501 can be used to call the device control application stored in the memory 505 Procedure to achieve:
接收客户端所发送的业务办理请求,该业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
从该第一语音数据中获取第一语义信息,根据该第一语义信息确定与该第一语义信息匹配的第二语音数据;该第二语音数据是针对该第一语音数据的回复数据,该第一语义信息用于反映该第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is reply data to the first voice data, the The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
从视频数据库中确定与该第二语音数据匹配的第二视频数据,该视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
对该第二视频数据与该第二语音数据进行合成处理,得到合成数据,将该合成数据发送至该客户端,根据该客户端回复的第一多媒体数据进行业务处理,该第一多媒体数据是该客户端针对该合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client. The media data is the reply data of the client for the composite data.
应当理解,本申请实施例中所描述的计算机设备50可执行前文图2、图3所对应实施例中对上述数据处理方法的描述,也可执行前文图4所对应实施例中对上述数据处理装置的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。It should be understood that the computer device 50 described in the embodiment of the present application can perform the foregoing data processing method described in the foregoing embodiment corresponding to FIG. 2 and FIG. 3, and may also perform the foregoing data processing method in the foregoing embodiment corresponding to FIG. 4 The description of the device will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.
本申请实施例中,通过接收客户端所发送的业务办理请求,业务办理请求包括第一语 音数据和第一视频数据;通过从第一语音数据中获取第一语义信息,根据第一语义信息确定与第一语义信息匹配的第二语音数据。由于第一语义信息可以反映第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系,因此得到的第一语义信息可以更准确的表示第一语音数据的含义,即语音识别的准确度更高,从而匹配得到的第二语音数据更准确,可以提高视频机器人通话的准确性。从视频数据库中确定与第二语音数据匹配的第二视频数据,对第二视频数据与第二语音数据进行合成处理,得到合成数据,将合成数据发送至客户端,根据客户端回复的第一多媒体数据进行业务处理。通过对用户发送的语音数据进行实时处理以及回复,可以提高业务处理的效率。通过将第二语音数据与第二视频数据进行合成后发送至客户端,用户通过客户端可以看到视频机器人与自己进行视频通话,使得人机交互更自然,从而增加人机交互的趣味性,提升用户体验。In this embodiment of the application, by receiving the service processing request sent by the client, the service processing request includes the first voice data and the first video data; by obtaining the first semantic information from the first voice data, the determination is made according to the first semantic information The second voice data that matches the first semantic information. Since the first semantic information can reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords, the obtained first semantic information can more accurately represent the first The meaning of the first voice data is that the accuracy of voice recognition is higher, so that the second voice data obtained by matching is more accurate, and the accuracy of the video robot call can be improved. The second video data matching the second voice data is determined from the video database, the second video data and the second voice data are synthesized to obtain the synthesized data, and the synthesized data is sent to the client according to the first reply from the client Multimedia data is processed for business. By processing and replying to the voice data sent by users in real time, the efficiency of business processing can be improved. By synthesizing the second voice data and the second video data and sending it to the client, the user can see the video robot making a video call with himself through the client, which makes the human-computer interaction more natural, thereby increasing the interest of the human-computer interaction. Improve user experience.
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被计算机执行时使该计算机执行如前述实施例该的方法,该计算机可以为上述提到的计算机设备的一部分。例如为上述的处理器501。作为示例,程序指令可被部署在一个计算机设备上执行,或者被部署位于一个地点的多个计算机设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算机设备上执行,分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链网络。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a computer, cause the computer to execute Method, the computer can be a part of the aforementioned computer equipment. For example, the aforementioned processor 501. As an example, the program instructions may be deployed and executed on one computer device, or be deployed on multiple computer devices located in one location, or on multiple computer devices that are distributed in multiple locations and interconnected by a communication network Execution, multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain network.
可选的,本申请涉及的介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,该的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. At this time, it may include the procedures of the embodiments of the above-mentioned methods. Among them, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。The above-disclosed are only preferred embodiments of this application, and of course the scope of rights of this application cannot be limited by this. Therefore, equivalent changes made in accordance with the claims of this application still fall within the scope of this application.

Claims (20)

  1. 一种数据处理方法,其中,包括:A data processing method, which includes:
    接收客户端所发送的业务办理请求,所述业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
    从所述第一语音数据中获取第一语义信息,根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据;所述第二语音数据是针对所述第一语音数据的回复数据,所述第一语义信息用于反映所述第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice data The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
    从视频数据库中确定与所述第二语音数据匹配的第二视频数据,所述视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
    对所述第二视频数据与所述第二语音数据进行合成处理,得到合成数据,将所述合成数据发送至所述客户端,根据所述客户端回复的第一多媒体数据进行业务处理,所述第一多媒体数据是所述客户端针对所述合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client , The first multimedia data is the reply data of the client to the composite data.
  2. 根据权利要求1所述的方法,其中,所述从所述第一语音数据中获取第一语义信息,包括:The method according to claim 1, wherein said obtaining first semantic information from said first speech data comprises:
    对所述第一语音数据进行转换,得到所述第一语音数据对应的第一文本数据;Converting the first voice data to obtain first text data corresponding to the first voice data;
    对所述第一文本数据进行关键词提取,得到至少两个关键词;Performing keyword extraction on the first text data to obtain at least two keywords;
    获取所述至少两个关键词的词义信息;Acquiring word meaning information of the at least two keywords;
    根据所述至少两个关键词得到词语组合,获取所述词语组合的组合词义信息;Obtaining word combinations according to the at least two keywords, and obtaining combined word meaning information of the word combinations;
    根据所述词义信息和所述组合词义信息确定所述至少两个关键词之间的关联关系;Determining the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;
    根据所述至少两个关键词和所述至少两个关键词之间的关联关系确定所述第一文本数据对应的语义信息,作为所述第一语义信息。The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords, as the first semantic information.
  3. 根据权利要求2所述的方法,其中,所述根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据,包括:The method according to claim 2, wherein said determining, according to said first semantic information, the second voice data matching said first semantic information comprises:
    获取所述至少两个关键词与语料库中的多个文本数据之间的相似度,得到多个第一相似度;Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;
    根据所述至少两个关键词之间的关联关系确定与所述关联关系对应的文本数据,获取所述关联关系对应的文本数据与所述多个文本数据之间的相似度,得到多个第二相似度;The text data corresponding to the association relationship is determined according to the association relationship between the at least two keywords, the similarity between the text data corresponding to the association relationship and the plurality of text data is obtained, and a plurality of first Two similarity
    将所述多个第一相似度中最大相似度对应的文本数据确定为第一目标文本数据,以及将所述多个第二相似度中最大相似度对应的文本数据确定为第二目标文本数据;Determine the text data corresponding to the largest similarity in the plurality of first similarities as the first target text data, and determine the text data corresponding to the largest similarity in the plurality of second similarities as the second target text data ;
    根据所述第一目标文本数据与所述第二目标文本数据确定所述第二文本数据;Determining the second text data according to the first target text data and the second target text data;
    对所述第二文本数据进行转换,得到所述第二文本数据对应的语音数据,作为所述第二语音数据。Converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
  4. 根据权利要求3所述的方法,其中,所述从视频数据库中确定与所述第二语音数据匹配的第二视频数据,包括:The method according to claim 3, wherein said determining from a video database the second video data matching the second voice data comprises:
    获取所述第二文本数据的语义场景,以及所述视频数据库中每个视频数据的应用场景;Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;
    从所述应用场景中确定出与所述语义场景匹配的目标应用场景;Determining a target application scenario matching the semantic scenario from the application scenario;
    将所述目标应用场景对应的视频数据确定为所述第二视频数据。The video data corresponding to the target application scene is determined as the second video data.
  5. 根据权利要求1所述的方法,其中,所述对所述第二视频数据与所述第二语音数据进行合成处理,得到合成数据,包括:The method according to claim 1, wherein said performing synthesis processing on said second video data and said second voice data to obtain synthesized data comprises:
    获取所述第二语音数据的语音时长;Acquiring the voice duration of the second voice data;
    从所述第二视频数据中截取与所述语音时长相等的视频数据,作为候选视频数据;Intercepting video data with a duration equal to the voice duration from the second video data as candidate video data;
    对所述第二语音数据和所述候选视频数据进行合成处理,得到所述合成数据。Perform synthesis processing on the second voice data and the candidate video data to obtain the synthesized data.
  6. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    从所述第一视频数据中截取所述客户端对应的用户的第一图像;Intercept the first image of the user corresponding to the client from the first video data;
    根据所述第一图像验证所述客户端的合法性;Verify the legitimacy of the client according to the first image;
    若所述客户端具有合法性,则执行所述根据所述客户端回复的第一多媒体数据进行业务处理的步骤。If the client is legal, the step of performing service processing according to the first multimedia data replies from the client is executed.
  7. 根据权利要求6所述的方法,其中,所述方法还包括:The method according to claim 6, wherein the method further comprises:
    若所述客户端不具有合法性,则发送用于指示所述用户进行姿态调整的调整信息至所述客户端;If the client is not legal, sending adjustment information for instructing the user to adjust the posture to the client;
    获取所述客户端针对所述调整信息发送的第二多媒体数据,所述第二多媒体数据包括第三视频数据;Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;
    从所述第三视频数据中截取所述用户的第二图像;Intercepting the second image of the user from the third video data;
    根据所述第二图像验证所述客户端的合法性。Verify the legitimacy of the client according to the second image.
  8. 一种数据处理装置,其中,包括:A data processing device, which includes:
    数据获取模块,用于接收客户端所发送的业务办理请求,所述业务办理请求包括第一语音数据和第一视频数据;A data acquisition module, configured to receive a service processing request sent by a client, where the service processing request includes the first voice data and the first video data;
    语音匹配模块,用于从所述第一语音数据中获取第一语义信息,根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据;所述第二语音数据是针对所述第一语音数据的回复数据,所述第一语义信息用于反映所述第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;The voice matching module is configured to obtain first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for Reply data of the first voice data, the first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords ;
    视频匹配模块,用于从视频数据库中确定与所述第二语音数据匹配的第二视频数据,所述视频数据库中包括多个视频数据;A video matching module, configured to determine second video data matching the second voice data from a video database, the video database including a plurality of video data;
    业务处理模块,用于对所述第二视频数据与所述第二语音数据进行合成处理,得到合成数据,将所述合成数据发送至所述客户端,根据所述客户端回复的第二多媒体数据进行业务处理,所述第二多媒体数据是所述客户端针对所述合成数据的回复数据。The service processing module is configured to perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and send the synthesized data to the client according to the second reply from the client. The media data is subjected to service processing, and the second multimedia data is the reply data of the client to the composite data.
  9. 一种计算机设备,其中,包括:处理器、存储器以及网络接口;A computer device, which includes: a processor, a memory, and a network interface;
    所述处理器与所述存储器、所述网络接口相连,其中,所述网络接口用于提供数据通信功能,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行以下方法:The processor is connected to the memory and the network interface, wherein the network interface is used to provide a data communication function, the memory is used to store program code, and the processor is used to call the program code to execute The following methods:
    接收客户端所发送的业务办理请求,所述业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
    从所述第一语音数据中获取第一语义信息,根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据;所述第二语音数据是针对所述第一语音数据的回复数据,所述第一语义信息用于反映所述第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice data The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
    从视频数据库中确定与所述第二语音数据匹配的第二视频数据,所述视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
    对所述第二视频数据与所述第二语音数据进行合成处理,得到合成数据,将所述合成数据发送至所述客户端,根据所述客户端回复的第一多媒体数据进行业务处理,所述第一多媒体数据是所述客户端针对所述合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client , The first multimedia data is the reply data of the client to the composite data.
  10. 根据权利要求9所述的计算机设备,其中,所述从所述第一语音数据中获取第一语义信息时,具体执行:The computer device according to claim 9, wherein when the first semantic information is obtained from the first voice data, the following is specifically executed:
    对所述第一语音数据进行转换,得到所述第一语音数据对应的第一文本数据;Converting the first voice data to obtain first text data corresponding to the first voice data;
    对所述第一文本数据进行关键词提取,得到至少两个关键词;Performing keyword extraction on the first text data to obtain at least two keywords;
    获取所述至少两个关键词的词义信息;Acquiring word meaning information of the at least two keywords;
    根据所述至少两个关键词得到词语组合,获取所述词语组合的组合词义信息;Obtaining word combinations according to the at least two keywords, and obtaining combined word meaning information of the word combinations;
    根据所述词义信息和所述组合词义信息确定所述至少两个关键词之间的关联关系;Determining the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;
    根据所述至少两个关键词和所述至少两个关键词之间的关联关系确定所述第一文本数据对应的语义信息,作为所述第一语义信息。The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords, as the first semantic information.
  11. 根据权利要求10所述的计算机设备,其中,所述根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据时,具体执行:The computer device according to claim 10, wherein when the second voice data matching the first semantic information is determined according to the first semantic information, the following is specifically executed:
    获取所述至少两个关键词与语料库中的多个文本数据之间的相似度,得到多个第一相似度;Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;
    根据所述至少两个关键词之间的关联关系确定与所述关联关系对应的文本数据,获取所述关联关系对应的文本数据与所述多个文本数据之间的相似度,得到多个第二相似度;The text data corresponding to the association relationship is determined according to the association relationship between the at least two keywords, the similarity between the text data corresponding to the association relationship and the plurality of text data is obtained, and a plurality of first Two similarity
    将所述多个第一相似度中最大相似度对应的文本数据确定为第一目标文本数据,以及将所述多个第二相似度中最大相似度对应的文本数据确定为第二目标文本数据;Determine the text data corresponding to the largest similarity in the plurality of first similarities as the first target text data, and determine the text data corresponding to the largest similarity in the plurality of second similarities as the second target text data ;
    根据所述第一目标文本数据与所述第二目标文本数据确定所述第二文本数据;Determining the second text data according to the first target text data and the second target text data;
    对所述第二文本数据进行转换,得到所述第二文本数据对应的语音数据,作为所述第二语音数据。Converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
  12. 根据权利要求11所述的计算机设备,其中,所述从视频数据库中确定与所述第二语音数据匹配的第二视频数据时,具体执行:The computer device according to claim 11, wherein when the second video data matching the second voice data is determined from the video database, the following is specifically executed:
    获取所述第二文本数据的语义场景,以及所述视频数据库中每个视频数据的应用场景;Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;
    从所述应用场景中确定出与所述语义场景匹配的目标应用场景;Determining a target application scenario matching the semantic scenario from the application scenario;
    将所述目标应用场景对应的视频数据确定为所述第二视频数据。The video data corresponding to the target application scene is determined as the second video data.
  13. 根据权利要求9所述的计算机设备,其中,所述处理器还用于执行:The computer device according to claim 9, wherein the processor is further configured to execute:
    从所述第一视频数据中截取所述客户端对应的用户的第一图像;Intercept the first image of the user corresponding to the client from the first video data;
    根据所述第一图像验证所述客户端的合法性;Verify the legitimacy of the client according to the first image;
    若所述客户端具有合法性,则执行所述根据所述客户端回复的第一多媒体数据进行业务处理的步骤。If the client is legal, the step of performing service processing according to the first multimedia data replies from the client is executed.
  14. 根据权利要求13所述的计算机设备,其中,所述处理器还用于执行:The computer device according to claim 13, wherein the processor is further configured to execute:
    若所述客户端不具有合法性,则发送用于指示所述用户进行姿态调整的调整信息至所述客户端;If the client is not legal, sending adjustment information for instructing the user to adjust the posture to the client;
    获取所述客户端针对所述调整信息发送的第二多媒体数据,所述第二多媒体数据包括第三视频数据;Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;
    从所述第三视频数据中截取所述用户的第二图像;Intercepting the second image of the user from the third video data;
    根据所述第二图像验证所述客户端的合法性。Verify the legitimacy of the client according to the second image.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下方法:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the following method:
    接收客户端所发送的业务办理请求,所述业务办理请求包括第一语音数据和第一视频数据;Receiving a service handling request sent by the client, where the service handling request includes the first voice data and the first video data;
    从所述第一语音数据中获取第一语义信息,根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据;所述第二语音数据是针对所述第一语音数据的回复数据,所述第一语义信息用于反映所述第一语音数据对应的第一文本数据中的至少两个关键词,以及至少两个关键词之间的关联关系;Acquire first semantic information from the first voice data, and determine second voice data matching the first semantic information according to the first semantic information; the second voice data is for the first voice data The first semantic information is used to reflect at least two keywords in the first text data corresponding to the first voice data, and the association relationship between the at least two keywords;
    从视频数据库中确定与所述第二语音数据匹配的第二视频数据,所述视频数据库中包括多个视频数据;Determining second video data matching the second voice data from a video database, where the video database includes multiple video data;
    对所述第二视频数据与所述第二语音数据进行合成处理,得到合成数据,将所述合成数据发送至所述客户端,根据所述客户端回复的第一多媒体数据进行业务处理,所述第一 多媒体数据是所述客户端针对所述合成数据的回复数据。Perform synthesis processing on the second video data and the second voice data to obtain synthesized data, send the synthesized data to the client, and perform service processing according to the first multimedia data replies from the client , The first multimedia data is the reply data of the client to the composite data.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述从所述第一语音数据中获取第一语义信息时,具体执行:The computer-readable storage medium according to claim 15, wherein when the first semantic information is obtained from the first speech data, the following is specifically executed:
    对所述第一语音数据进行转换,得到所述第一语音数据对应的第一文本数据;Converting the first voice data to obtain first text data corresponding to the first voice data;
    对所述第一文本数据进行关键词提取,得到至少两个关键词;Performing keyword extraction on the first text data to obtain at least two keywords;
    获取所述至少两个关键词的词义信息;Acquiring word meaning information of the at least two keywords;
    根据所述至少两个关键词得到词语组合,获取所述词语组合的组合词义信息;Obtaining word combinations according to the at least two keywords, and obtaining combined word meaning information of the word combinations;
    根据所述词义信息和所述组合词义信息确定所述至少两个关键词之间的关联关系;Determining the association relationship between the at least two keywords according to the word meaning information and the combined word meaning information;
    根据所述至少两个关键词和所述至少两个关键词之间的关联关系确定所述第一文本数据对应的语义信息,作为所述第一语义信息。The semantic information corresponding to the first text data is determined according to the association relationship between the at least two keywords and the at least two keywords, as the first semantic information.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述第一语义信息确定与所述第一语义信息匹配的第二语音数据时,具体执行:16. The computer-readable storage medium according to claim 16, wherein when the second speech data matching the first semantic information is determined according to the first semantic information, the following is specifically executed:
    获取所述至少两个关键词与语料库中的多个文本数据之间的相似度,得到多个第一相似度;Acquiring similarities between the at least two keywords and multiple text data in the corpus to obtain multiple first similarities;
    根据所述至少两个关键词之间的关联关系确定与所述关联关系对应的文本数据,获取所述关联关系对应的文本数据与所述多个文本数据之间的相似度,得到多个第二相似度;The text data corresponding to the association relationship is determined according to the association relationship between the at least two keywords, the similarity between the text data corresponding to the association relationship and the plurality of text data is obtained, and a plurality of first Two similarity
    将所述多个第一相似度中最大相似度对应的文本数据确定为第一目标文本数据,以及将所述多个第二相似度中最大相似度对应的文本数据确定为第二目标文本数据;Determine the text data corresponding to the largest similarity in the plurality of first similarities as the first target text data, and determine the text data corresponding to the largest similarity in the plurality of second similarities as the second target text data ;
    根据所述第一目标文本数据与所述第二目标文本数据确定所述第二文本数据;Determining the second text data according to the first target text data and the second target text data;
    对所述第二文本数据进行转换,得到所述第二文本数据对应的语音数据,作为所述第二语音数据。Converting the second text data to obtain voice data corresponding to the second text data as the second voice data.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述从视频数据库中确定与所述第二语音数据匹配的第二视频数据时,具体执行:18. The computer-readable storage medium according to claim 17, wherein when the second video data matching the second voice data is determined from the video database, the following is specifically executed:
    获取所述第二文本数据的语义场景,以及所述视频数据库中每个视频数据的应用场景;Acquiring the semantic scene of the second text data and the application scene of each video data in the video database;
    从所述应用场景中确定出与所述语义场景匹配的目标应用场景;Determining a target application scenario matching the semantic scenario from the application scenario;
    将所述目标应用场景对应的视频数据确定为所述第二视频数据。The video data corresponding to the target application scene is determined as the second video data.
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述程序指令当被处理器执行时还用于使所述处理器执行:The computer-readable storage medium according to claim 15, wherein the program instructions, when executed by the processor, are also used to cause the processor to execute:
    从所述第一视频数据中截取所述客户端对应的用户的第一图像;Intercept the first image of the user corresponding to the client from the first video data;
    根据所述第一图像验证所述客户端的合法性;Verify the legitimacy of the client according to the first image;
    若所述客户端具有合法性,则执行所述根据所述客户端回复的第一多媒体数据进行业务处理的步骤。If the client is legal, the step of performing service processing according to the first multimedia data replies from the client is executed.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述程序指令当被处理器执行时还用于使所述处理器执行:The computer-readable storage medium according to claim 19, wherein the program instructions when executed by the processor are also used to cause the processor to execute:
    若所述客户端不具有合法性,则发送用于指示所述用户进行姿态调整的调整信息至所述客户端;If the client is not legal, sending adjustment information for instructing the user to adjust the posture to the client;
    获取所述客户端针对所述调整信息发送的第二多媒体数据,所述第二多媒体数据包括第三视频数据;Acquiring second multimedia data sent by the client for the adjustment information, where the second multimedia data includes third video data;
    从所述第三视频数据中截取所述用户的第二图像;Intercepting the second image of the user from the third video data;
    根据所述第二图像验证所述客户端的合法性。Verify the legitimacy of the client according to the second image.
PCT/CN2020/124256 2020-09-22 2020-10-28 Data processing method and apparatus, device, and medium WO2021159734A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011006333.7 2020-09-22
CN202011006333.7A CN112131365A (en) 2020-09-22 2020-09-22 Data processing method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2021159734A1 true WO2021159734A1 (en) 2021-08-19

Family

ID=73842593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124256 WO2021159734A1 (en) 2020-09-22 2020-10-28 Data processing method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN112131365A (en)
WO (1) WO2021159734A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114760425A (en) * 2022-03-21 2022-07-15 京东科技信息技术有限公司 Digital human generation method, device, computer equipment and storage medium
CN115022395B (en) * 2022-05-27 2023-08-08 艾普科创(北京)控股有限公司 Service video pushing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008216461A (en) * 2007-03-01 2008-09-18 Nec Corp Speech recognition, keyword extraction, and knowledge base retrieval coordinating device
CN108090170A (en) * 2017-12-14 2018-05-29 南京美桥信息科技有限公司 A kind of intelligence inquiry method for recognizing semantics and visible intelligent interrogation system
CN109241332A (en) * 2018-10-19 2019-01-18 广东小天才科技有限公司 It is a kind of to determine semantic method and system by voice
CN110405791A (en) * 2019-08-16 2019-11-05 江苏遨信科技有限公司 A kind of robot imitates and the method and system of study speech
KR20200081925A (en) * 2018-12-28 2020-07-08 수상에스티(주) System for voice recognition of interactive robot and the method therof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918913B (en) * 2017-11-20 2022-01-21 中国银行股份有限公司 Bank business processing method, device and system
CN110162780B (en) * 2019-04-08 2023-05-09 深圳市金微蓝技术有限公司 User intention recognition method and device
CN110489527A (en) * 2019-08-13 2019-11-22 南京邮电大学 Banking intelligent consulting based on interactive voice and handle method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008216461A (en) * 2007-03-01 2008-09-18 Nec Corp Speech recognition, keyword extraction, and knowledge base retrieval coordinating device
CN108090170A (en) * 2017-12-14 2018-05-29 南京美桥信息科技有限公司 A kind of intelligence inquiry method for recognizing semantics and visible intelligent interrogation system
CN109241332A (en) * 2018-10-19 2019-01-18 广东小天才科技有限公司 It is a kind of to determine semantic method and system by voice
KR20200081925A (en) * 2018-12-28 2020-07-08 수상에스티(주) System for voice recognition of interactive robot and the method therof
CN110405791A (en) * 2019-08-16 2019-11-05 江苏遨信科技有限公司 A kind of robot imitates and the method and system of study speech

Also Published As

Publication number Publication date
CN112131365A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN111488433B (en) Artificial intelligence interactive system suitable for bank and capable of improving field experience
CN111883123B (en) Conference summary generation method, device, equipment and medium based on AI identification
CN110415687B (en) Voice processing method, device, medium and electronic equipment
US9361891B1 (en) Method for converting speech to text, performing natural language processing on the text output, extracting data values and matching to an electronic ticket form
CN106373575B (en) User voiceprint model construction method, device and system
WO2022095380A1 (en) Ai-based virtual interaction model generation method and apparatus, computer device and storage medium
CN111858892B (en) Voice interaction method, device, equipment and medium based on knowledge graph
US9621851B2 (en) Augmenting web conferences via text extracted from audio content
WO2015062284A1 (en) Natural expression processing method, processing and response method, device, and system
WO2021159734A1 (en) Data processing method and apparatus, device, and medium
CN113901189A (en) Digital human interaction method and device, electronic equipment and storage medium
CN113630309B (en) Robot conversation system, method, device, computer equipment and storage medium
CN113111658B (en) Method, device, equipment and storage medium for checking information
CN114065720A (en) Conference summary generation method and device, storage medium and electronic equipment
CN113903338A (en) Surface labeling method and device, electronic equipment and storage medium
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN114902217A (en) System for authenticating digital content
CN112163084A (en) Question feedback method, device, medium and electronic equipment
CN111415662A (en) Method, apparatus, device and medium for generating video
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
CN112489662B (en) Method and apparatus for training speech processing model
WO2021159745A1 (en) Data processing method and apparatus, device, and medium
US11810132B1 (en) Method of collating, abstracting, and delivering worldwide viewpoints
CN116915528A (en) Method, device, equipment and storage medium for identifying multi-mode interaction information
CN112818706B (en) Voice translation real-time dispute recording system and method based on reverse result stability

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20919008

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20919008

Country of ref document: EP

Kind code of ref document: A1