CN114049902B

CN114049902B - Aricloud-based recording uploading identification and emotion analysis method and system

Info

Publication number: CN114049902B
Application number: CN202111252398.4A
Authority: CN
Inventors: 吕文哲; 陈炳标; 柯志忠; 许东武
Original assignee: Guangdong Infinite Information Technology Co ltd
Current assignee: Guangdong Infinite Information Technology Co ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-04-07
Anticipated expiration: 2041-10-27
Also published as: CN114049902A

Abstract

The invention discloses a recording uploading identification and emotion analysis method and system based on Aliskiren, which comprises the following steps: manually uploading a recording file to be identified to a server; the server uploads the audio file to the Ali cloud to perform recording identification and character conversion; recombining and analyzing JSON strings transmitted back by Aliskiu, and assembling into sentences; carrying out time point segmentation and telephone number identification processing on the sentences, and outputting the sentences to a webpage; uploading the sentences to Alice cloud for emotion analysis; analyzing the emotion analysis JSON string returned by Aliskiren; matching the parsed data with the segmented sentences; an emotion analysis heatmap is generated. According to the invention, the recognized result is fully utilized based on the intelligent voice interaction and natural language processing of Aliskiren; an asynchronous approach is also used to avoid blocking on programming. The method has the advantages of high availability, high accuracy and high efficiency, and is widely applied to the field of recording analysis.

Description

Aricloud-based recording uploading recognition and emotion analysis method and system

Technical Field

The invention relates to the technical field of Internet cloud, in particular to a recording uploading recognition and emotion analysis method based on an Ariiyun.

Background

With scientific progress, speech recognition technology has also rapidly developed over the last decade. According to the prediction of professionals, the speech recognition technology will enter various fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like within ten to twenty years. Speech recognition technology is a high level of technology that allows a machine to recognize and understand speech signals into corresponding text or commands. The current mainstream chat software QQ, weChat and other voice recognition is limited to a real-time chat process, and the voice time is short; the recording analysis technology aiming at the long recording in the market has the problems of low recording identification speed, poor accuracy, incomplete identification result, less practical information and the like.

Disclosure of Invention

In view of this, the embodiment of the invention provides a method and a system for recording uploading identification and emotion analysis based on the Aliskiren cloud.

The invention provides a recording uploading identification and emotion analysis method based on Aliskiren, which comprises the following steps of:

uploading the recording file to be identified to a server;

the server distinguishes and classifies the sound recording file to be identified;

uploading a recording file to be identified to Aliyun, and identifying and converting the recording file to be identified on the Aliyun;

the method comprises the steps that a server obtains first JSON strings transmitted back by Aliskiu, and each first JSON string corresponds to a word or a short sentence;

recombining and analyzing the first JSON string which is transmitted back, and assembling into a sentence;

the server carries out time point segmentation and telephone number identification on the assembled sentence;

uploading the sentences to an Ali cloud server for emotion analysis;

the server acquires a second JSON string returned by the Alice cloud and analyzes the returned second JSON string;

the server matches the analytic data with sentences segmented according to time points sentence by sentence;

and the server generates an emotion analysis heat map according to the analysis data.

Further, the server distinguishes and classifies the sound recording file to be identified, including: and distinguishing and classifying the sound recording files to be identified according to the format, the size and the duration of the sound recording files to be identified.

Further, the uploading the audio record file to be identified to a server includes:

and uploading the audio files to be identified to a server in a binary stream form through an http protocol.

Further, the restructuring and parsing the returned first JSON string and assembling the first JSON string into a sentence specifically includes:

acquiring a taskID in a first JSON string of the Ali cloud postback, wherein the taskID is used for distinguishing the difference between each audio file;

acquiring first JSON strings corresponding to the taskIDs according to the taskIDs to obtain words or short sentences corresponding to the first JSON strings;

polling the first JSON string and uploading the first JSON string to Alice cloud;

acquiring a third JSON string corresponding to each word or short sentence after polling returned by Ali cloud;

extracting words or phrases from the third JSON string using FastJSON;

and extracting and splicing the key information in the third JSON string into a sentence, and storing the sentence into a database.

Further, the extracting and splicing the key information in the third JSON string into a sentence, and storing the sentence into a database specifically includes the following steps:

extracting key information in each third JSON string by using a regular expression according to key words, wherein the key words are Chinese characters and punctuation marks appearing in the returned third JSON strings;

analyzing the time information in the returned third JSON string by using a regular expression;

and arranging and combining the sentence sequences according to the time information, and splicing the sentences into a sentence.

Further, the server performs time point segmentation and telephone number identification on the assembled sentence, and specifically includes:

extracting the starting time and the ending time of the corresponding words or phrases from the first JSON string;

subtracting the starting time from the ending time to obtain the duration of the words or phrases;

matching the duration with words or phrases, and storing the words or phrases in a database;

analyzing the numbers in the character recognition result by using a regular expression, and marking the numbers as telephone numbers when the number length is 11 bits;

storing all the telephone numbers into a database, and counting the occurrence times of the telephone numbers by using the view;

and outputting the telephone number and the times to a webpage, and matching the telephone number and the times with the corresponding timestamp.

Further, the server generates an emotion analysis heat map according to the analysis data, and specifically includes the following steps:

acquiring an emotion analysis result of the Alice cloud, wherein the emotion analysis result comprises an emotion fluctuation value;

counting emotion analysis results in an accumulation mode;

establishing a rectangular coordinate system by taking the time node of the recording as a horizontal coordinate and the emotion fluctuation value as a vertical coordinate;

matching the accumulated result with the corresponding time node, and establishing a line graph in a coordinate system;

and generating an emotion analysis heat map, and outputting the emotion analysis heat map to a webpage end.

Further, before uploading the audio file to be identified to the Ali cloud, the method further comprises the following steps:

extracting the file type, the file size and the file duration of the sound recording file to be identified and storing the file type, the file size and the file duration into a database;

and matching the recording files in the database according to the file type, the file size and the file duration of the audio to be identified in the database, and if the same recording file is matched in the database, directly outputting the analysis result of the matched recording file and the emotion analysis heat map to a webpage end.

Further, the method for recording uploading identification and emotion analysis based on the Aliskiren further comprises the following steps:

analyzing and emotion analyzing a to-be-identified recording file by adopting multiple threads, independently allocating threads to each step, and executing a recording uploading identification and emotion analyzing method based on the Aliskiren in an asynchronous mode;

and adding a synchronous lock between the steps, and not executing the subsequent step when the previous step is not executed completely.

The invention also discloses a recording identification and emotion analysis system which is characterized by comprising a user login and registration module, a file uploading module, a recording identification module, a recording analysis module and an emotion analysis module;

the user login registration module is used for carrying out authority management on the user;

the file uploading module, the recording recognition module, the recording analysis module and the emotion analysis module are used for executing the method in a matched mode.

The invention has the following beneficial effects: the invention analyzes the recording file and segments the sentences according to time; in the applicable scenes of the invention, such as conference recording and telephone customer service recording, the method can realize the quick positioning of the recording time node and browse the related analysis information, thereby realizing the screening of key problems in the recording, the effective monitoring of the recording and the like. The invention performs targeted extraction on the telephone number information mentioned in the recording, and is convenient for technicians to rapidly record and count the telephone number. The invention not only identifies and analyzes the recording file, but also generates the emotion analysis heat map by utilizing the analysis result, thereby facilitating technicians to judge the emotion trend of the recording object and quickly and intuitively know the emotion fluctuation during conversation. In the overall implementation of the invention, the invention directly calls the analysis result output to the sound recording file with the analysis result in the database, and forms a class cache mechanism to improve the execution efficiency; the invention further improves the usability by distributing multiple threads and adding synchronous locks.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a data flow diagram of the present invention;

FIG. 2 is a block diagram of the present invention;

FIG. 3 is an E-R diagram of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment provides a solution for the problems of speed and accuracy of sound recording identification, incomplete analysis of identification results, less practical information and the like in long sound recording in scenes such as company meetings, telephone customer service and the like. The specific framework is shown in fig. 1, and the implementation scheme of the embodiment is as follows:

1. uploading the audio recording files to be identified to a server, wherein the audio recording files support single-track/double-track audio recording file identification in formats of wav,. Mp3,. M4a,. Wma,. Aac,. Ogg,. Amr and. Flac.

2. And the server distinguishes and classifies the sound recording file to be identified according to the format, the size and the duration of the sound recording file to be identified, and stores the detailed information of the sound recording file into a MySQL database.

3. Uploading a recording file to be identified to Aliyun, and identifying and converting the recording file to be identified on the Aliyun;

4. the server recombines and analyzes the JSON strings transmitted back by the Aliskiren cloud, and assembles the JSON strings into sentences;

5. the server performs time point segmentation and telephone number identification on the assembled sentence, and simultaneously outputs the sentence to a webpage end and stores the sentence in a MySQL database;

6. uploading the sentence to an Ali cloud server for emotion analysis;

7. the server acquires JSON strings returned by the Arizian cloud, analyzes the returned JSON strings to form coherent numerical values with a sequence, and stores the analysis result in a MySQL database;

8. the server matches the analytic data with sentences segmented according to time points sentence by sentence;

9. and the server generates an emotion analysis heat map according to the analysis data.

The recombination and analysis of the 4 th step comprises the following specific steps:

1) Acquiring a task Id in the Aliskiu postback JSON string, wherein the task is used for distinguishing the difference between each audio file;

2) Acquiring corresponding JSON strings according to the taskId, wherein each JSON string corresponds to each short sentence, and performing polling;

3) After the Aliyun is processed and transmitted back to the server, acquiring each polled word or short sentence JSON string, and extracting the short sentences by using FastJson;

4) Extracting key information in each JSON according to keywords, splicing the key information into sentences and storing the key information into a database, wherein the keywords mentioned in the part refer to Chinese characters and punctuations appearing in returned JSON strings, and regular rules are used for extracting the part of information.

Wherein, the step 5 carries out time point segmentation and telephone number identification processing on the sentence, and comprises the following steps:

1) Extracting corresponding starting time and ending time (taking milliseconds as a unit) from the JSON strings of the words and the short sentences obtained in the step 4;

2) Subtracting the starting time from the ending time to obtain the duration of the short sentence;

3) Matching the duration and the corresponding short sentence in a database and putting the short sentence in a storage;

4) Processing the digits in the recognition result by using the regular pattern, and judging the digits as telephone numbers when the digit length is equal to eleven digits;

5) And finally, outputting the part of telephone numbers and the times, matching the part of telephone numbers and the times to corresponding time stamps and adapting to the webpage.

The emotion analysis heat map generation method in the step 9 comprises the following specific steps:

1) Obtaining the emotion analysis result of Alicloun, and obtaining the result in a polling mode;

2) Counting the data result in an accumulation mode;

3) Matching the accumulated result of each time with the corresponding time point of the sound recording file (the abscissa is time, and the ordinate is an emotion fluctuation value) and generating a line graph;

4) Generating an emotion analysis heat map and outputting the heat map in a webpage, wherein the partial output result is matched according to time and data, emotion data are processed only after corresponding text information is processed in a database, and emotion and text are matched according to time sequence, wherein-1 is negative emotion, 1 is positive emotion, and 0 is no emotion.

In addition, in the embodiment, a class cache mechanism is integrated into the database, and by storing the past recording analysis data in a database, the user searches the database to match information such as the name, time, frequency, size and the like of the recording file each time the user converts the recording into a character. If the same file exists in the database, the data analyzed previously is directly returned to the database, and if the file is different, the file is uploaded to the Aliyun to carry out the operation of converting the recording into the characters.

In the method for recording uploading identification and emotion analysis of the embodiment, audio and emotion information is analyzed in a multithreading mode. The method comprises the steps that firstly, threads are distributed in the speech-to-text and emotion analysis polling stages, a synchronous lock is added to the method, each audio is polled firstly when the service is operated, emotion analysis is carried out on an audio text after an emotion analysis function waits for one thread in the text-to-text function to finish operation, similarly, the emotion of the text can be analyzed only when the complete recognition result of one audio is recognized in a database, and matching of time points is judged simultaneously during analysis. And when the text recognition and emotion analysis results are recombined and analyzed in a program and all data of the audio in the database are put in storage, analyzing the emotion analysis results by a corresponding algorithm and outputting the emotion analysis results to a webpage. The method adopts multiple threads to perform recording recognition and emotion analysis, and solves the problem that a server does not respond due to program blockage caused by canceling text conversion operation in the midway.

The project structure of this embodiment is shown in fig. 2. The structure function module consists of five parts, namely a user login registration module, a file uploading module, a recording identification module, a recording analysis module and an emotion analysis module. The user login registration module is responsible for managing the authority of the user and is irrelevant to the main functions of recording reality and the like. The file upload module plays a major role in uploading files locally and by the server. The recording identification module is mainly responsible for providing detailed information and a preview function of a browser-side file and a recording identification interface connected to the Aliskiren cloud so as to realize a function of converting recording identification into characters. The module optimizes the performance of the program during design, and designs the original synchronous audio conversion into asynchronization, thereby greatly optimizing the audio recognition speed and avoiding the problems of blocking and the like to a great extent at the program level. And the recording analysis module is responsible for identifying, judging, analyzing and recombining the result returned by the Ali cloud. The module realizes the main functions of analyzing the telephone number, counting the occurrence times of the telephone number, checking the beginning, the ending, the duration and the like of a sentence, and storing data in a database. And the emotion analysis module is responsible for taking the recombined sentences out of the database and uploading the sentences to the Aliyun emotion analysis interface, analyzing and processing the returned results, and then generating the heat map from the processed results.

The E-R diagram of this example is shown in FIG. 3. The common user realizes user authentication through a user name, an id, a mailbox and a password; uploading the recording file after the user logs in, wherein the file information comprises the id, the name, the size and the content of the recording file, the uploading time and the uploading type of the file; analyzing the recording file by the server after uploading, wherein the analysis content comprises id of the recording file, file name, sentence duration, sentence starting time, sentence ending time, sentence content, telephone number in the content, occurrence frequency of the telephone number in the content, and emotion analysis result and state; and the user acquires the analysis result of the sound recording file on the browser interface.

The embodiment is based on intelligent voice interaction and natural language processing of Aliskiu. The method has the advantages that the accuracy of the recognition result is guaranteed, the recognition efficiency is improved, the recognized result is fully utilized, meanwhile, the optimization is carried out at the program end, the embodiment is executed in an asynchronous mode, and the method has the advantages of being high in availability and execution efficiency.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise indicated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those of ordinary skill in the art will be able to practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A recording uploading recognition and emotion analysis method based on Aliskiren is characterized by comprising the following steps:

uploading the recording file to be identified to a server;

uploading a to-be-identified sound recording file to Aliyun, and identifying and converting the to-be-identified sound recording file on the Aliyun;

the server performs time point segmentation and telephone number identification on the assembled sentence;

uploading the sentence to Aliyun for emotion analysis;

the server acquires a second JSON string returned by the Alice cloud, and analyzes the returned second JSON string;

the server generates an emotion analysis heat map according to the analysis data;

the recombining and analyzing the returned first JSON string and assembling into a sentence specifically comprises the following steps:

acquiring first JSON strings corresponding to the tasKIDs according to the tasKIDs to obtain words or phrases corresponding to the first JSON strings;

extracting words or phrases from the third JSON string using FastJSON;

extracting and splicing key information in the third JSON string into sentences, and storing the sentences in a database;

extracting and splicing the key information in the third JSON string into a sentence, and storing the sentence into a database, wherein the method specifically comprises the following steps:

2. The method for audio record uploading recognition and emotion analysis based on Ariiyun as claimed in claim 1, wherein said server discriminates and classifies said audio record file to be recognized, comprising: and distinguishing and classifying the sound recording files to be identified according to the format, the size and the duration of the sound recording files to be identified.

3. The method for the audio record uploading identification and sentiment analysis based on the Aliskiren cloud as claimed in claim 1, wherein the uploading the audio record file to be identified to a server comprises:

4. The method for voice record uploading identification and emotion analysis based on the Aliskiren cloud as claimed in claim 1, wherein the server performs time point segmentation and phone number identification on the assembled sentence, specifically comprising:

5. The aricloud-based recording upload identification and emotion analysis method according to claim 1, wherein the server generates an emotion analysis heatmap from the parsed data, specifically comprising the steps of:

obtaining an emotion analysis result of Aliclou, wherein the emotion analysis result comprises an emotion fluctuation value;

counting emotion analysis results in an accumulation mode;

6. The method for the record uploading identification and emotion analysis based on the Aliskian cloud as claimed in claim 1, wherein before the record file to be identified is uploaded to the Aliskian cloud, the method further comprises the following steps:

and matching the recording files in the database according to the file types, the file sizes and the file durations of the recording files to be identified in the database, and if the same recording files are matched in the database, directly outputting the analysis results of the matched recording files and the emotion analysis heat map to a webpage end.

7. The method for argun-based voice recording upload recognition and sentiment analysis according to claims 1-6, wherein the method for argun-based voice recording upload recognition and sentiment analysis further comprises:

and adding a synchronous lock among the steps, and not executing the subsequent step when the previous step is not executed.

8. A recording identification and emotion analysis system is characterized by comprising a user login registration module, a file uploading module, a recording identification module, a recording analysis module and an emotion analysis module;

the file uploading module, the recording recognition module, the recording analysis module and the emotion analysis module are used for cooperatively executing the method of any one of claims 1-6.