CN110351445B

CN110351445B - High-concurrency VOIP recording service system based on intelligent voice recognition

Info

Publication number: CN110351445B
Application number: CN201910530307.5A
Authority: CN
Inventors: 袁熹
Original assignee: Chengdu Kangshengsi Technology Co ltd
Current assignee: Chengdu Kangshengsi Technology Co ltd
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2020-09-29
Anticipated expiration: 2039-06-19
Also published as: CN110351445A

Abstract

The invention relates to a high-concurrency VOIP recording service system based on intelligent voice recognition, which comprises a recording service module, a voice recognition module and a voice recognition module, wherein the recording service module is used for directly decoding and storing audio of left and right sound channels at a device layer and a chip layer to obtain recording audio; the buffer register is configured on a software layer and is subjected to time synchronization, and when audio concurrency higher than the current capacity occurs, the capacity of the buffer register is increased as required; the recording file management module is used for inputting the recording audio decoded and stored by the decoding storage module into a buffer queue; the voice recognition engine is used for coding and decoding the recorded audio and assembling the voice media data packet into correct voice media stream data through feature extraction; the voice media stream data is transmitted to the background service system through the MQ message queue middleware, and the scheme can not only process more than 200 session processes at the same time, but also improve the recording quality.

Description

High-concurrency VOIP recording service system based on intelligent voice recognition

Technical Field

The invention relates to the field of recording service, in particular to a high-concurrency VOIP recording service system based on intelligent voice recognition.

Background

With the rapid development of IT technology, the traditional PSTN telephone Network has been unable to meet the communication requirement, especially after the occurrence of VOIP, VOIP (Voice over Internet protocol) simply digitizes the analog signal (Voice) and transmits IT in the form of Data Packet (Data Packet) in real time on the IP Network (IP Network). Enterprises adopt VOIP technology to gradually replace call center services based on PSTN lines so as to meet the requirements of convenient, uniform and cheap communication. However, with the demand and industry upgrade brought by the mobile internet technology, the traditional recording service has been difficult to meet the urgent demands of customers for high concurrency, quick identification, lower-cost operation using machine learning to replace the manpower of a call center, and the like.

The prior art has the following disadvantages: in the conventional solution at present, an AI engine is generally responsible for recognition and decoding of audio input and feature extraction, so as to enter a recognition stage, while the decoding process itself consumes system resources, which is expensive in time, and when multiple paths of voice inputs occur simultaneously, the consumption of the system resources is very large, thereby causing the problems of poor recording quality and unsmooth call, which cannot process voice recognition in large batch, and can only achieve 200 concurrent capabilities at most.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a high-concurrency VOIP recording service system based on intelligent voice recognition, which can not only process more than 200 session processes simultaneously, but also improve the recording quality.

The purpose of the invention is realized by the following technical scheme:

high concurrency VOIP recording service system based on intelligent speech recognition, the system includes:

the recording service module is used for directly decoding and storing the audio of the left and right sound channels at the equipment and chip layers to obtain recording audio;

the buffer register is configured on a software layer and is subjected to time synchronization, and when audio concurrency higher than the current capacity occurs, the capacity of the buffer register is increased as required;

the recording file management module is used for inputting the recording audio decoded and stored by the decoding storage module into a buffer queue;

the voice recognition engine is used for coding and decoding the recorded audio and assembling the voice media data packet into correct voice media stream data through feature extraction;

and the voice media stream data is transmitted to the background service system through the MQ message queue middleware.

Furthermore, the recording service module adopts a subprocess cluster module to introduce a multi-process working mode, and each process runs in a single thread; recording core service is divided into two types of main process and working process, a subprocess cluster is realized by adopting a cluster mechanism of node.js, and the main technical points are subprocess state monitoring, a reliable interprocess communication mechanism and a load scheduling mechanism.

Furthermore, the reliable interprocess communication mechanism adopts a cluster interprocess communication mechanism to transmit data among the subprocesses and assists in self-defining an RPC mechanism, and the interprocess communication mechanism comprises the following steps:

1) a father process investigates a pipe function to create a pipe, obtains two file descriptors and points to two ends of the pipe;

2) a parent process investigates fork to create a child process, and the child process has two file descriptors and points to two ends of the same pipeline;

3) and the parent process closes reading, and the child process closes writing, so that the parent process writes information into the pipeline, and the child process reads from the pipeline.

Further, the custom RPC mechanism defines two operations, namely request/response and event notification;

request response:

the method is characterized in that the method is sent by an RPC caller, a call parameter is coded by an RPC framework layer and then transmitted to an RPC processor by an inter-process communication mechanism for processing, and after the processing of the RPC processor is completed, a processing result is coded by the RPC framework layer and then transmitted to the RPC caller by the inter-process communication mechanism for subsequent processing.

Event notification:

the event notification is realized by calling an RPC notification interface by an event source to transmit event parameters, and the RPC framework layer encodes the event parameters and transmits the encoded event parameters to an event listener by an inter-process communication mechanism to be processed.

Furthermore, the load scheduling mechanism realizes the scheduling of the work process by using a minimum load mode, the load calculation adopts a hook mechanism of RPC, the execution condition of the load distributed to the worker by the master is analyzed, and the current load of the worker is calculated according to a calculation strategy;

the load scheduling mechanism of the sub-process cluster calculates the working load of each working process, the newly added load is always put into the working process with the minimum load, a recording session is established after a recording core service receives a request initiated by equipment, each session consists of two interactive states of a signaling channel and a media channel, the signaling channel receives a control instruction of the equipment side to the session, the media channel receives media stream data transmitted by the equipment side, and the recording session management module distributes the signaling session and the media session into different processes.

Furthermore, the recording service module is configured with a media stream management module, after receiving the message that the recording session is successfully established, the device sends the media stream to a media session SOCKET established by the recording service, the media stream management module performs rapid verification of the media packet and extraction of packet structure information, and after the media packet structure information is successfully extracted, the media stream can be delivered to a recording engine to convert the voice media stream on the network into a voice file on a disk;

the media stream management module compresses and packetizes the recording stream on the basis of a G729 protocol, the recording service adopts a real-time transmission mode, a worker is called to process the media stream after a main process receives the media stream, the worker uses packetization processing when processing the media stream, an incoming/outgoing call occupies one channel to transmit the media stream, the incoming/outgoing call stream is packetized in the media stream module, and the incoming/outgoing call media stream is not directly merged when storing a file but is directly stored as an original byte code.

Further, the recording engine uses the jitter buffer to smoothly process the problem of packet loss and disorder of voice data packets, and replaces the packet loss and silence with 0db voice based on the specification of the recording system without performing optimization processing of voice.

Further, the Jitter buffer processor stores the state information of the current recording voice stream, and stores a segment of the voice stream in the memory, when receiving an RTP packet, the Jitter buffer processor first needs to extract the meta-information of the RTP packet to determine which part of the voice stream in the RTP packet so as to place the RTP packet at the proper position of the current segment, and after the current segment is finished, the Jitter buffer processor transfers the voice segment to the file memory to be stored on the disk, and then prepares to process the RTP message of the next segment.

Besides the jitter buffer, the recording engine also comprises a file memory for storing the audio of the left and right sound channels which are directly decoded and stored on the equipment and chip layers to obtain the recorded audio, and the file memory determines the path and the name of the recording file corresponding to the session according to the configuration and determines when the writing operation and the closing operation of the session recording file are performed.

Further, the recording engine uses a 4-level directory to divide the storage space of the recording file; the method comprises the steps of firstly, storing sound recording files generated by each VoIP calling device separately, then installing the date, hour and minute of the beginning of a sound recording session, storing the actual sound recording files in a minute directory of the level 4, and simultaneously realizing a quantity limiter to avoid the situation that the directory of the level of a single minute is too large, so as to support the storage and management of the sound recording files of the level of tens of millions per day.

Further, the background service system comprises a device access system, a concurrent buffer system, a data management system, a service system and an application system;

the equipment access system uses an asynchronous non-blocking IO model to be responsible for processing interaction with equipment and completing processing and distribution of recorded streams, all business processes are not required to be returned after being processed, the equipment access system is used for processing accessed streaming media in a concentrative way, after the processing is completed, the upper layer application is informed through a message queue to continue subsequent business processing, and the access system can be released;

the concurrency buffer system uses MQ to isolate problems of overlong link, overlong service processing, high concurrency and the like, and completes data forwarding through a data format designed in a standard way;

the service system is responsible for services abstracted by the system and performs uniform abstraction on an upper layer;

the database uses two databases of Mysql and MongoDB according to the requirement, and the two databases are separately stored according to the service requirement.

The invention has the beneficial effects that: compared with the traditional recording service, the scheme directly decodes and stores the audio input of the left and right sound channels at the equipment and chip layers, the buffer is designed and time synchronization is well done at the software layer, the step needs to be completed together with hardware, and when thousands of audio concurrences occur, the buffer capacity is only increased as required, so that the voice recognition can be processed more efficiently and in real time.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of the composition of a platform layer according to the present invention;

FIG. 3 is a flow chart of the RPC mechanism of the present invention;

FIG. 4 is a schematic diagram of the voice recognition service module according to the present invention.

Detailed Description

The technical solution of the present invention is further described in detail with reference to the following specific examples, but the scope of the present invention is not limited to the following.

The technical challenge of the invention mainly lies in how to complete the simultaneous recording, storage and speech recognition translation for high-concurrency speech input. In the conventional solution at present, the AI engine is generally responsible for the recognition and decoding of the audio input and the feature extraction, so as to enter the recognition stage, and the decoding process itself consumes system resources, which is very terrorist in terms of time overhead cost.

As shown in fig. 1, in the system, in order to meet the real-time high-concurrency scene requirement, audio inputs of left and right channels are directly decoded and stored at a device layer and a chip layer, buffers are designed at a software layer and time synchronization is well performed, the steps need to be completed together with hardware, and when thousands of audio concurrencies occur, the speech recognition can be processed more efficiently and in real time only by increasing the capacity of the buffers as required. The system is composed of an equipment layer, a network layer, a platform layer and an application layer, and the improvement technology of the scheme is mainly embodied in the platform layer.

As shown in fig. 2, the platform layer is mainly composed of three major components, which are a recording service, a voice recognition service and a service management system.

Recording service module

The internal architecture of the recording service module is divided into 5 main modules, including a subprocess cluster, media stream management, recording session management, a recording engine and an encoding and decoding adapter.

The recording core service is a multi-process working mode, and each process is operated in a single thread mode.

The subprocess cluster module introduces a multi-process working mode and divides the recording core service into a main process and a working process. Js cluster is realized by a cluster mechanism of nodes, and the main technical points are subprocess state monitoring, a reliable interprocess communication mechanism and a load scheduling mechanism.

The reliable interprocess communication mechanism adopts the cluster interprocess communication mechanism (namely the pipe mechanism of the operating system) to transmit the data among the subprocesses, and assists in self-defining the RPC mechanism, thereby improving the convenience and reliability of the recording service.

Each process has different user address spaces, the global variable of any process cannot be seen in the other process, so data to be exchanged among the processes must pass through a kernel, a buffer area is opened in the kernel, the process 1 copies the data from the user space to the kernel buffer area, the process 2 reads the data from the kernel buffer area, and the mechanism provided by the kernel is called inter-process communication.

A pipe is one of the most basic IPC mechanisms, created by the pipe function. The conduit acts between processes with blood relationship and is passed through fork. When the pipe function is called, a buffer area (called a pipe) is opened in the kernel for communication, the buffer area has a reading end and a writing end, and then two file descriptors are transmitted to a user program through files parameters, the files [0] points to the reading end of the pipe, and the files [1] points to the writing end of the pipe (which is well remembered as if 0 is standard input 1 and standard output). The pipeline appears to the user program as an open file, either by read (files [0]), or by write (files [1]), and the reading and writing of data to this file is actually in the read-write kernel buffer. The pipe function call successfully returns 0, and the call failure returns-1.

The method comprises the following steps of using a pipe mechanism to realize process communication:

the father process investigates the pipe function to create a pipe, obtains two file descriptors and points to two ends of the pipe

The father process investigates fork to create a child process, and the child process has two file descriptors which are identical and points to two ends of the same pipeline

And the parent process closes reading, and the child process closes writing, so that the parent process writes information into the pipeline, and the child process reads from the pipeline.

As shown in FIG. 3, the voice recording service system further enhances convenience and reliability by customizing a set of RPC protocols, providing requests/responses and event machines.

The "event dispatcher" and the "communication endpoint a" and the "communication endpoint B" in fig. 3 belong to the communication framework layer of the RPC. The RPC protocol of the recording service defines two operations: request/response and event notification.

Request response:

the RPC framework layer encodes the calling parameters and transmits the encoded parameters to an inter-process communication mechanism to be transmitted to an RPC processor for processing. After the RPC processor finishes processing, the RPC framework layer encodes the processing result and transfers the processing result to the RPC caller by the inter-process communication mechanism for subsequent processing.

Event notification:

And the sub-process state monitoring is realized by monitoring the process exit event of the cluster module, and the sub-process is restarted immediately after the process exit event is received.

The load scheduling mechanism of the subprocess cluster realizes the scheduling of the work process by using a minimum load mode, and the load calculation adopts a hook mechanism of RPC. The current workload of the worker is calculated according to a calculation strategy by analyzing the execution condition of the master for distributing the load to the worker (work process).

And the load scheduling mechanism of the subprocess cluster calculates the working load of each working process and always puts the newly added load into the working process with the minimum load. The recording session is established by the recording core service after receiving the request initiated by the equipment, and each path of session is composed of two interactive states of a signaling channel and a media channel. The signaling channel receives the control instruction of the device side to the session, and the media channel receives the media stream data transmitted by the device side. The recording session management module distributes the signaling session and the media session into different processes, and successfully solves two main requirements of centralized control required by the signaling session and large throughput required by the media session.

The recording service host process tells the sub-process to create a media session through an RPC mechanism. After the media session is successfully created, the recording service host process also needs to subscribe to an end event of the media session by using an RPC mechanism to monitor the working state of the media session.

After receiving the message that the recording session is successfully established, the device end sends the media stream to a media session SOCKET established by the recording server end, the media stream management module performs quick verification of the media packet and extraction of the packet structure information, and after the media packet structure information is successfully extracted, the media stream can be sent to a recording engine to convert the voice media stream on the network into a voice file on a disk.

The media stream management module compresses and packetizes the recording stream based on the protocols such as G729 (standard audio protocol). The recording service adopts a real-time transmission mode, after a main process receives a media stream, a worker is called to process the media stream, when the worker processes the media stream, the traditional processing mode (media stream executable voice file) is abandoned, the worker uses the sub-packet processing which is self-designed by the worker, an incoming/outgoing call occupies one channel to perform media stream transmission, the incoming/outgoing call stream is sub-packet processed in a media stream module, the incoming/outgoing call media stream is not directly merged when storing the file, but is directly stored as an original byte code, and thus, the mode of independently storing the incoming/outgoing call bytes greatly improves the processing efficiency and the system concurrency quantity of the recording service. When the recording file is played, the system internally combines the recording file and then outputs and plays the recording file.

And installing a recording processing path in the media stream management, transmitting the received voice media packets layer by layer, verifying and checking the packets in each layer according to the capability range of each layer, filtering illegal packets, and finally filtering to obtain effective media stream RTP packets.

Media data transmitted over a network typically suffers from packet loss, misordering, and the like. In order to save bandwidth, a silence packet is sometimes used to replace a period of ultra-low decibel voice data stream, and after receiving a media data packet, a recording engine module needs to identify and repair the conditions, and then the voice media data packet can be assembled into correct voice media stream data. In order to avoid slow disk IO blocking the process from running during storage, an asynchronous file IO mechanism provided by an operating system is used.

The recording engine uses the jitter buffer to smoothly process the problem of packet loss and disorder of the voice data packets. Based on the specification of a recording system, 0db sound is used for replacing packet loss and silence, and optimization processing of voice is not performed.

The Jitter buffer processor stores the state information of the current recorded voice stream and stores a segment of the voice stream in the memory. When receiving an RTP packet, it is first necessary to extract meta-information of the RTP packet to decide which part of the speech stream of the RTP packet to put the RTP packet at the proper position of the current segment. After the current segment is finished, the jitter buffer processor gives the voice segment to the file memory to be stored on the disk, and then the RTP message of the next segment is prepared to be processed.

In addition to the jitterbuffer, another important component of the sound recording engine is the file storage.

The file memory determines the path and name of the recording file corresponding to the session according to the configuration, and determines when the writing operation and closing operation of the session recording file are performed.

The recording engine divides the storage space of the recording file by a 4-level directory. The recording file generated by each VoIP calling device is stored separately (the device recording file directory is named by the device serial number), then the date, hour and minute of the beginning of the recording session are installed, and the actual recording file is stored in the minute directory of the 4 th level. A number limiter is also implemented to avoid a single minute level directory being too large. Thus, tens of millions of sound recording file storage and management are supported.

The specific working process comprises the following steps: after receiving the recording request, the recording session management directly decodes and stores the audio of the left and right sound channels in a file memory at the equipment and chip layer through a recording engine.

Meanwhile, the buffer register is configured on a software layer and time synchronization is performed, and when audio concurrency higher than the current capacity occurs, the capacity of the buffer register is increased as required;

The voice recognition engine is a voice recognition module based on artificial intelligence, and the voice recognition service module is shown in fig. 4, so that the packaging and scheduling of AI capabilities are realized, the AI service capabilities of a third party can be docked and referred, and meanwhile, the AI capabilities of a company can be used, so that the scene requirements of recording services are fully met.

The module finally needs to support Chinese and English speech recognition:

the Chinese only supports Mandarin, the format for acquiring the voice comprises formats such as PCM, WAV and the like, the Mandarin identification supports 8k/16k of sampling rate and 16bits of sampling depth;

the voice recognition comprises real-time voice recognition and file voice recognition;

the speech recognition outputs characters, and the recognition accuracy rate of the Putonghua reaches 95%;

background business system

The background service system mainly solves the high concurrency requirement in the real sense through the optimization of the message middleware and the bottom database.

The congestion caused by high concurrency scenes is solved in an asynchronous mode;

the old protocol is compatible, and all the current integrated gateway equipment can be smoothly accessed into the new recording service;

the expansibility is good, and the subsequent protocol analysis of the friend equipment and the adaptation of the general protocol of the Internet of things can be supported;

in the system architecture, the system pressure caused by the fact that the pressure borne by the system mainly comes from the access of a large number of devices is considered, so that an asynchronous non-blocking IO model is used in the access system, the thread execution efficiency is improved, and the system processing concurrency capability is increased. And in order to prevent the system from being punctured by high concurrency and further increase the processing capacity of the access system, the system is integrally divided into a hierarchy and designed into a hierarchy:

system for controlling a power supply	Technology stack	Description of the invention
			Device access system	Netty+Zookeeper	Distributed communication architecture based on asynchronous non-blocking, supporting dynamic capacity expansion
Concurrent buffer system	RabbitMQ	Message queue clustering to reduce high concurrency puncture risk
			Data management	Mysql+MongoDB+Redis	Multiple data classified storage, database concurrent link reduction and data security improvement
Business system	Springboot+Mybatis plus+Shiro+druid	SAAS-based multi-tenant system, unified equipment management design and standardized API design
			Application system	NodeJs+react	Normative front-end applications, independently deployable

The equipment access system is responsible for processing interaction with the equipment, completing processing and distribution of the recorded stream, returning after all service flows are processed, the equipment access system is used for processing the accessed stream media in a concentrative way, and informing the upper layer application to continue subsequent service processing through a message queue after the processing is completed, so that the access system can release the processed stream media;

the concurrent buffering uses MQ to isolate problems of overlong link, overlong service processing, high concurrency and the like, and data forwarding is completed through a data format designed in a standard way;

the service system only concerns the service abstracted by the system, and uniform abstraction is carried out on the upper layer.

The database uses two databases of Mysql and MongoDB according to the requirement, and the two databases are separately stored according to the service requirement. The characteristics of MongoDB, such as easiness in starting, high capacity and quick response, are fully exerted, quick response is required when the MongoDB is used in a system, the storage with large data volume is required, and the concurrent processing capacity of the system is improved.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. High concurrency VOIP recording service system based on intelligent speech recognition is characterized in that the system comprises:

the voice media stream data is transmitted to the background service system through the MQ message queue middleware;

the recording service module adopts a subprocess cluster module to introduce a multi-process working mode, and each process is operated by a single thread; recording core service is divided into two types of main process and working process, a subprocess cluster is realized by a cluster mechanism of node.js, and the technical points are subprocess state monitoring, reliable interprocess communication mechanism and load scheduling mechanism;

the reliable interprocess communication mechanism adopts a cluster interprocess communication mechanism to transmit data among the subprocesses and assists in self-defining an RPC mechanism, and the interprocess communication mechanism comprises the following steps:

2. The intelligent voice recognition based highly concurrent VOIP recording service system according to claim 1, wherein the custom RPC mechanism defines two operations, request/response and event notification;

request response:

the RPC framework layer encodes a processing result and transmits the processing result to the RPC caller for subsequent processing;

event notification:

3. The high-concurrency VOIP recording service system based on intelligent voice recognition as claimed in claim 2, wherein the load scheduling mechanism realizes scheduling of a work process by using a minimum load mode, a hook mechanism of RPC is adopted for load calculation, and the current load of a worker is calculated according to a calculation strategy by analyzing the execution condition of distributing the load to the worker by a master;

4. The high-concurrency VOIP recording service system based on intelligent voice recognition as claimed in claim 3, wherein the recording service module is configured with a media stream management module, after receiving the message that the recording session is successfully established, the device sends the media stream to the media session SOCKET established by the recording service, the media stream management module performs fast verification of the media packet and extraction of packet structure information, and after successfully extracting the media packet structure information, the media stream can be delivered to a recording engine to convert the voice media stream on the network into a voice file on a disk;

5. The system of claim 4, wherein the recording engine uses a jitter buffer to smooth the packet loss and disorder problem of the voice data packets, and based on the specification of the recording system, replaces the packet loss and silence with 0db voice without performing voice optimization.

6. The system of claim 5, wherein the Jitter buffer processor stores the state information of the current recording voice stream and stores a segment of the voice stream in the memory, when receiving the RTP packet, it first needs to extract the meta information of the RTP packet to determine which part of the voice stream in the RTP packet so as to place the RTP packet in the proper position of the current segment, after the current segment is finished, the Jitter buffer processor delivers the voice segment to the file memory to be stored on the disk, and then prepares to process the RTP packet of the next segment;

7. The intelligent voice recognition-based highly concurrent VOIP recording service system according to claim 6, wherein the recording engine divides the storage space of the recording file with a 4-level directory; the method comprises the steps of firstly, storing sound recording files generated by each VoIP calling device separately, then installing the date, hour and minute of the beginning of a sound recording session, storing the actual sound recording files in a minute directory of the level 4, and simultaneously realizing a quantity limiter to avoid the situation that the directory of the level of a single minute is too large, so as to support the storage and management of the sound recording files of the level of tens of millions per day.

8. The intelligent voice recognition-based highly concurrent VOIP recording service system according to claim 7, wherein the background service system includes a device access system, a concurrent buffering system, a data management system, a service system, an application system;

the concurrency buffer system uses MQ to isolate overlong link, overlong service processing and high concurrency problem, and completes data forwarding through a data format designed in a standard way;