GB2616512A

GB2616512A - Cloud service platform system for speakers

Info

Publication number: GB2616512A
Application number: GB2300697.6A
Authority: GB
Inventors: Zhang Xuejun; Li Bin; Zeng Hongjie; Xu Xianfu; Zhang Susu
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2022-01-26
Filing date: 2023-01-17
Publication date: 2023-09-13
Also published as: GB202300697D0; CN114550715A

Abstract

A cloud service platform system for smart speakers comprising a speech input module, network connection and player receives positioning data from working speakers, marks two speakers within a threshold distance of each other as “suspected same group” and sends a “suspected same group acknowledgment message” to the corresponding speakers. If the feedback result is “yes”, data for the same group of speakers is unified and transmitted to any speaker within the group, and any speaker transmits data within the group. If a “NO” result is fed back, play data is transmitted successively according to a weight priority and volume is controlled. This allows a single application to manage conflicts between nearby speakers.

Description

CLOUD SERVICE PLATFORM SYSTEM FOR SPEAKERS

TECHNICAL FIELD

[0001] The present invention relates to the field of cloud service for speakers, and more particularly to a cloud service platform system for speakers.

BACKGROUND

[0002] Smart speaker is an upgraded product of a speaker, by which a household consumer can surf the Internet via speech, e.g., requesting a song, doing online shopping, or learning about a weather forecast. A smart speaker may also be used to control smart home devices, such as opening curtains, setting a refrigerator temperature, warming a water heater in advance. Baidu released its first own-brand smart speaker "Xiaodu Smart Speaker" in Beijing on June 11, 2018. Baidu's artificial intelligence (Al) assistant "Xiaodu Smart Speaker Donkey Kong" was released on Xiaodu Store on June 1, 2019. Huavvei SoundX smart speaker developed jointly by Huawei and Devialet was released officially on November 25 in the same year.

[0003] Existing cloud service platform systems have the technical problems of a single application and failing to effectively regulating a relationship between speakers. The present invention can solve the problems by providing a cloud service platform system for speakers.

SUMMARY

[0004] The technical problems to be solved in the present invention are the technical problems of a single application and failing to effectively regulating a relationship between speakers in the prior art. There is provided a novel cloud service platform system for speakers, which has the characteristics of multiple purposes and the capability to effectively deal with a conflict and a relationship between speakers.

[0005] To solve the above-mentioned technical problems, the following technical solutions are adopted.

[0006] A cloud service platform system for speakers is provided, where the speakers each includes a speech input module, a network connection unit and a player; and the cloud service platform system for the speakers includes a cloud server, and a network connection unit for linking the cloud server and the speakers. The speakers each are provided with a positioning detection unit, and the cloud server receives data from the positioning detection unit in real time; and the cloud server performs the following steps of detecting a conflict between the speakers: [0007] step 1, receiving positioning data from speakers in a working state; [0008] step 2, determining states of the speakers in operation based on the positioning data, and if a positioning distance between two speakers is smaller than a predefined threshold, marking corresponding speakers as "a suspected same group" and sending a "suspected same group acknowledgment message" to the corresponding speakers; and [0009] step 3, receiving a feedback result for the "suspected same group acknowledgment message"; if the result is "yes", unifying data of the same group of speakers, transmitting the unified data to any speaker in the same group of speakers and controlling any speaker to transmit data within the same group; and if the result is "NO", transmitting play data successively according to a weight priority and controlling the volume.

[0010] The working principle in the present invention is as follows: the present invention uses the positioning information of the speakers as a basis for determining whether the speakers are in the "suspected same group", and regulates effectively a relationship between the speakers that are possibly associated or have a conflict therebetween according to the feedback result. Meanwhile, the cloud server can control a plurality of Internet applications.

[0011] As an optimization to the above solution, further, the weight priority is determined by the cloud server through the following steps: [0012] step 1.1, determining networking starting time of the speakers, with an earlier time corresponding to a higher priority; and [0013] step 1.2, determining self-check states of the speakers, with a better self-check state corresponding to a higher priority.

[0014] Further, the cloud server is also capable of invoking other applications on the Internet according to instructions from the speakers.

[0015] Further, the cloud server further receives speech control signals from the speakers to perform speech recognition, including the following steps: [0016] step 1: creating a historical speech feature map library, where a historical speech feature map is created by extracting features from statement speech input in advance or recorded as history and drawing a statement speech feature map including word, phrase and sentence feature maps; [0017] step II: extracting features from statement speech acquired by the speakers in real time, and drawing a target statement speech feature map; selecting and defining any statement speech feature map in the historical speech feature map library as a reference image, and defining the target statement speech feature map as a target image; [0018] step III: binarizing the target image IC, and defining that a value of 1 indicates having a speech feature and that a value of 0 indicates having no speech feature; meshing the binarized feature map into a grid map, defining a first point (xl, yl) of the grid map as an origin, defining a retrieval matching stride as L. and performing retrieval from the origin in x direction; if a point having the value of 1 is retrieved, recording a position and the value of the point and numbering the point in order; otherwise, continuing retrieval matching; [0019] step IV: updating point (xl, yl+N*L) as the origin, performing step III again until the retrieval matching in the x direction and y direction is completed, thereby completing initial positioning retrieval matching, where N is an integer, and L is a constant; [0020] step V: successively extracting points having the value of 1, updating the current extracted point having the value of 1 as the origin, updating the retrieval matching stride to L/2, performing the retrieval matching successively in the x direction without. performing the retrieval matching on points previously subjected to the retrieval matching, automatically halving the retrieval matching stride when the retrieval matching extends beyond a range of the target image, and continuing the retrieval matching until the retrieval matching stride comes to a minimum; defining a new point having the value of 1 appearing during the retrieval matching as a new point needing to be subjected to the retrieval matching in the y direction, performing step VI; otherwise, performing step VII; [0021] step VI: performing the retrieval matching successively in the y direction while keeping the retrieval matching stride of L/2 unchanged, performing the retrieval matching successively in the y direction without performing the retrieval matching on points previously subjected to the retrieval matching, automatically halving the retrieval matching snide when the retrieval matching extends beyond the range of the target image, and continuing the retrieval matching until the retrieval matching stride comes to a minimum; defining a new point having the value of 1 appearing during the retrieval matching as a new point needing to be subjected to the retrieval matching in the x direction, performing step V; otherwise, performing step VII; [0022] step VII: ending the retrieval matching until no new point needs to be subjected to the retrieval matching, and defining a region with the points having the value of 1 obtained by the retrieval matching as an effective target image; [0023] step VIII: performing a search matching analysis on the effective target image in the historical speech feature map library; and [0024] step 1X: invoking a corresponding strategy according to a recognition result.

[0025] Further, step VIII further includes image correction, which includes the following steps: [0026] step a: defining the effective target image as I, and selecting and defining any reference image in the historical speech feature map library as Ic; [0027] step b: defining an association relationship between the reference image lc and the target image qvi obtained by polar coordinate transformation: (r, yo) = (azr, -cpz), where a is a scale offset parameter, and (pz is a rotation offset parameter; [0028] step c: calculating, in the radial direction in a polar coordinate system, a projectionKc(i) of the reference image r: Kc(i) = f2i Zinn /PC (i, j), and a projection Kr (0 of the target image l: K7 (J) = n1 X21 I Pm (i, j), taking logarithms of Kc (i) and K7(i) to obtain LICc(i) and LK140), and using a translational difference between LICc(i) and LK:f(i) as the scale offset parameter az, wherein IP(i, j) = I (Kmaz + Ki sin (2=), Kmax Ki cos( 2=Ti)), i=1, 2, ..., nr, =1, 2, ...7-7;; flp flp 12i = ticp/n,p; and 771 =12/0 1) -firni(1 1)]. 172 = 1- i; IJ IJ ij where ñ is the number of samples in an angular direction when Ki =Kmax; IR) represents a maximum integer which is less than or equal to the value within the bracket; the target image has a size of 2K.,x2Kmax; nr=Kmax, representing the number of samples in the radial direction; and nr8K1, representing the number of samples in the angular direction; [0029] step d: calculating projections of the reference image r and the target image qvi in the radial direction and the angular direction according to the scale offset parameter in step c: i ----c -[nil 1Pc (i,ce(1))+77(2i1Pc cc (L))1, a, > 1 of ni and c 0, az < 1 n1.1 pm ( C-1 1 0, = vnr.azigii z ce -)) 77; and + 3,f pm (.

a, ij z cc a, < Om, az > 1 -fin performing a normalization calculation on B and 0 z to obtain a translation amount of the highest point, and calculating the rotation offset parameter yo, according to = 2n-dhcp, where cc() represents a minimum integer which is greater than or equal to the value within the bracket; [0030] step c: putting the rotation offset parameter (pz and the scale offset parameter a, to step A to correct the target image, and calculating, as the center point of the target image, a posi ion point \I IT corresponding to the minimum of E by cz = [ E.(/' 0 (i) -0c (i -d)] thereby i=1 z completing image correction.

[0031] Further, the search matching analysis in step VIII further includes the following steps: [0032] step A: making concentric circles with the center point of the target image ITas the center to divide the speech feature image into B annular regions, and finally, dividing each annular region into K sectors, K and B both being predefined constants; [0033] step B: calculating, as Codel, a sector speech feature value Vsqo of each sector Ssq: vs,10 = ,*,(Ensq I I Fsqo (x, y) -Psqo [0034] where Fsq0(x,y) represents a gray value of each pixel of the sector Ssq; No represents an average value of gray values of pixels in the sector Ssq; nsq represents the number in the annular region Ssq; 0<sq<BxK-1, 0={0°, (360°/K), 2*(3609K), 3*(3607K),...<180°}; [0035] step C: rotating the speech feature image (1809K), repeating step B, and extracting a sector speech feature value Vsqe of each sector Ssq as Code2; [0036] step E: rotating Codel and Code2 Rx(3607K) (R=0, 1, 2, ..., K-1) to obtain Code 1' and Code2', respectively: and [0037] step F: inputting Codel and Code2, and Code 1' and Code2' in step E to the historical speech feature map library for matching.

[0038] The present invention has the following beneficial effects: the present invention uses the positioning information of the speakers as a basis for determining whether the speakers are in the "suspected same group", and a relationship between the speakers that are possibly associated or have a conflict therebetween is then regulated effectively according to the feedback result. Meanwhile, the cloud server can control a plurality of Internet applications. Feature recognition of speech is converted to global recognition of a feature map so that higher recognition efficiency can be achieved. The accuracy and efficiency of control can be improved by performing correction and positioning processing on a feature image.

BRIEF DESCRIPTION OF THE DRAWINGS

[0039] The present invention will be further described Mow in conjunction with the accompanying drawings and an example.

[0040] FIG. 1 is a flowchart of detection of a conflict between the speakers.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0041] To make the objective, technical solutions and advantages of the present invention clearer, the present invention will be further described below in detail below with reference to the embodiments. It will be understood that the specific example described herein is merely used to explain, rather than limit, the present invention.

Example 1

[0042] This example provides a cloud service platform system for speakers, where the speaker includes a speech input module, a network connection unit and a player; and the cloud service platform system for speakers includes a cloud server, and a network connection unit for linking the cloud server and the speakers. The speakers each are provided with a positioning detection unit, and the cloud server receives data from the positioning detection unit in real time; and the cloud server performs the following steps of detecting a conflict between the speakers: [0043] step 1: receiving positioning data from speakers in a working state; [0044] step 2: deteimining states of the speakers in operation based on the positioning data, and if a positioning distance between two speakers is smaller than a predefined threshold, marking corresponding speakers as "a suspected same group" and sending a "suspected same group acknowledgment message" to the corresponding speakers; and [0045] step 3: receiving a feedback result for the "suspected same group acknowledgment message"; if the result is "yes", unifying data of the speakers in the same group, transmitting the unified data to any speaker in the same group and controlling any speaker to transmit data within the same group; and if the result is "NO", transmitting play data successively according to a weight priority and controlling the volume.

[0046] In this example, the positioning information of the speakers is used as a basis for determining whether the speakers are in the "suspected same group", and a relationship between the speakers that are possibly associated or have a conflict therebetween is then regulated effectively according to the feedback result. Meanwhile, the cloud server can control a plurality of Internet applications.

[0047] Specifically, the weight priority is determined by the cloud server through the following steps: [0048] step 1.1: determining networking starting time of the speakers, with an earlier time corresponding to a higher priority; and [0049] step 1.2: determining self-check states of the speakers, with a better self-check state corresponding to a higher priority.

[0050] Specifically, the cloud server is also capable of invoking other applications on the Internet according to instructions from the speakers.

[0051] In one embodiment, the cloud speaker further receives speech control signals from the speakers to perform speech recognition, including the following steps: [0052] step I: creating a historical speech feature map library, where a historical speech feature map is created by extracting features from statement speech input in advance or recorded as history and drawing a statement speech feature map including word, phrase and sentence feature maps; [0053] step II: extracting features from statement speech acquired by the speakers in real time, and drawing a target statement speech feature map; selecting and defining any statement speech feature map in the historical speech feature map library as a reference image, and defining the target statement speech feature map as a target image; [0054] step III: binarizing the target image lc, and defining that a value of 1 indicates having a speech feature and that a value of 0 indicates having no speech feature: meshing the binarized feature map into a arid map, defining a first point (xi, yl) of the grid map as the origin, defining a retrieval matching stride as L, and performing retrieval from the origin along x direction; if a point having the value of 1 is retrieved, recording a position and the value of the point and numbering the point in order; otherwise, continuing retrieval matching; [0055] step IV: updating point (xl, yl+N*L) as the origin, performing step III again until the retrieval matching in x direction and y direction is completed, thereby completing initial positioning retrieval matching, where N is an integer, and L is a constant; [0056] step V: successively extracting points having the value of 1, updating the current extracted point having the value of 1 as the origin, updating the retrieval matching stride to L/2, performing the retrieval matching successively in the x direction without performing the retrieval matching on points previously subjected to the retrieval matching, automatically halving the retrieval matching stride when the retrieval matching extends beyond the range of the target image, and continuing the retrieval matching until the retrieval matching stride comes to a minimum; defining a new point having the value of 1 appearing during the retrieval matching as a new point needing to be subjected to the retrieval matching in the y direction, and performing step VI; otherwise, performing step VII; [0057] step VI: performing the retrieval matching successively in the y direction while keeping the retrieval matching stride of L/2 unchanged and without performing the retrieval matching on points previously subjected to the retrieval matching, automatically halving the retrieval matching stride when the retrieval matching extends beyond the range of the target image, and continuing the retrieval matching until the retrieval matching stride comes to a minimum; defining a new point having the value of 1 appearing during the retrieval matching as a new point needing to be subjected to the retrieval matching in the x direction, and performing step V; otherwise, performing step VIE; [0058] step VII, ending the retrieval matching until no new point needs to be subjected to the retrieval matching, and defining a mgion with the points having the value of 1 obtained by the retrieval matching as an effective target image; [0059] step VILE: performing a search matching analysis on the effective target image in the historical speech feature map library; and [0060] step DC, invoking a corresponding strategy according to a recognition result.

[0061] In one embodiment, step VIII further includes image correction, which includes the following steps: [0062] step a: defining the effective target image as in and selecting and defining any reference image in the historical speech feature map library as Ic; [0063] step b: defining an association relationship between the reference image lc and the target image ly obtained by polar coordinate transformation as follows: 1(r, p) = ic(azr, -coz).

where az is a scale offset parameter, and yoz is a rotation offset parameter; [0064] step c: calculating, in the radial direction in a polar coordinate system, a projection Kc (i) of the reference image Ic: (i) = 12 * Eng' I Pc (i, j), and a projection Kf (i) of the target j=1 image 01: (i) = fli flyi 1 Pm (i, j) , taking logarithms of Kc(i) and K(i)to obtain LKc(i) and LKT(i), and taking a translational difference between LICe(i) and LKT(J0 as the scale offset parameter az, where /P(ii) = /(Kmax + K sin -27d K -max K cos (-27i)), i= 1, 2, ..., nr, j=1, 2, _ft; ; 71(p 71(p a, = 17 = flj-1)-flini(i -1)1,77.= 1-where ii."; is the number of samples in an angular direction when K=K,""; 110 represents a maximum integer which is less than or equal to the value within the bracket; the target image has a size of 21C,""ax2K,",z; n,=K",",, representing the number of samples in the radial direction; and rk=8K1, representing the number of samples in the angular direction; [0065] step d: calculating projections of the reference image lc and the target image qvf in the radial direction and the angular direction according to the scale offset parameter in step c: If-fl i,ce(L) ,a,>1 Di ni and _ThrazkiiP z (ice (I-1T)) + I Pr cc (L) , az < 1 1 I m * 12i CIZ = ----c M and performing a normalization calculation on 0 and 0 z to obtain a translation amount d of the highest point, and calculating the rotation offset parameter yoz according to (I), = 27rdtc, where ce() represents a minimum integer which is greater than or equal to the value within the bracket; [0066] step e: putting the rotation offset parameter yo, and the scale offset parameter az to step A to correct the target image, and calculating a position point Pzm corresponding to the minimum of II ( p- M j ----- , + E,, A by cz = Ei=i [ 0 z (0 -0c0 -d)1 as the center point of the target image, thereby completing image correction.

[0067] In one embodiment, the search matching analysis in step VIII further includes the following steps: [0068] step A: making concentric circles with the center point of the target image Ins the center to divide the speech feature image into B annular regions, and finally, dividing each annular region into K sectors, where K and B both are predefined constants; [0069] step B: calculating a sector speech feature value Vsqe of each sector Ssq as Code I: Vsqo = 117' (Ensq I I Fsqo (x, y) -Pscio F F), where Fsqe(x,y) represents a gray value of each pixel of the sector Ssq; Kilo represents an average value of gray values of pixels in the sector Ssq; mg represents the number in the annular region Ssq; 0<sq<BxK-1, 0=(0°, (360°/K), 2*(360°/K), 3*(3607K), ...<1801; [0070] step C: rotating the speech feature image (1807K), repeating step B, and extracting a sector speech feature value Vsge of each sector Ssq as Code2; [0071] step E: rotating Codel and Code2 Rx(360°/K) (R=0, 1, 2, ..., K-1) to obtain Code l' and Code2', respectively; and [0072] step F: inputting Codel and Code2, and Code I' and Code2' in step E to the historical speech feature map library for matching.

[0073] In this example, the positioning information of the speakers is used as a basis for determining whether the speakers are in the "suspected same group", and a relationship between the speakers that are possibly associated or have a conflict therebetween is then regulated effectively according to the feedback result. Meanwhile, the cloud server can control a plurality of Internet applications. Feature recognition of speech is converted to global recognition of a feature map so that higher recognition efficiency can be achieved. The accuracy and efficiency of control can be improved by performing correction and positioning processing on a feature image.

[0074] While the illustrative specific embodiment of the present invention is described above so that those skilled in the art can understand the present invention, the present invention is not limited to the scope of the specific embodiment. For those skilled in the art, all invention-creations using the concept of the present invention shall fall within the protection scope of the present invention as long as various alterations are made within the spirit and scope of the present invention defined and determined by the appended claims.

Claims

CLAIMSI. A cloud service platform system for speakers, the speakers each comprising a speech input module, a network connection unit and a player, the cloud set-vice platform system for speakers comprising a cloud server, and a network connection unit for linking the cloud set-vet-and the speakers, wherein the speakers each are provided with a positioning detection unit, and the cloud server receives data from the positioning detection unit in real time; and the cloud server performs the following steps of detecting a conflict between the speakers: step 1: receiving positioning data from speakers in a working state; step 2: determining states of the speakers in operation based on the positioning data, and if a positioning distance between two speakers is smaller than a predefined threshold, marking corresponding speakers as "a suspected same group" and sending a "suspected same group acknowledgment message" to the corresponding speakers; and step 3: receiving a feedback result for the "suspected same group acknowledgment message"; when the result is "yes", unifying data of a same group of speakers, transmitting the unified data to any speaker in the same group of speakers and controlling any speaker to transmit data within the same group; and when the result is "NO", transmitting play data successively according to a weight priority and controlling volume.
2. The cloud service platform system for the speakers according to claim 1, wherein the weight priority is determined by the cloud server through the following steps: step 1.1: determining networking starting time of the speakers, with an earlier time corresponding to a higher priority; and step 1.2: determining self-check states of the speakers, with a better self-check state corresponding to a higher priority.
3. The cloud service platform system for the speakers according to claim 1, wherein the cloud server is also capable of invoking other applications on the Internet according to instructions from the speakers.
4. The cloud service platform system for the speakers according to any one of claims 1 to 3, wherein the cloud server further receives speech control signals from the speakers to perform speech recognition, comprising: step I: creating a historical speech feature map library, wherein a historical speech feature mapIIis created by extracting features from statement speech input in advance or recorded as history and drawing a statement speech feature map comprising word, phrase and sentence feature maps; step II: extracting features from statement speech acquired by the speakers in real time, and drawing a target statement speech feature map; selecting and defining any statement speech feature map in the historical speech feature map library as a reference image, and defining the target statement speech feature map as a target image; step III: binarizing the target image lc, and defining that a value of 1 indicates having a speech feature and that a value of 0 indicates having no speech feature; meshing the binarized feature map into a grid map, defining a first point (xi, yi) of the grid map as an origin, defining a retrieval matching stride as L, and performing retrieval from the origin in x direction; when a point having the value of 1 is retrieved, recording a position and the value of the point and numbering the point in order; when a point having the value of 0 is retrieved, continuing retrieval matching; step IV: updating point (xi, y 1+N*L) as the origin, perfomting step HI again until the retrieval matching in an x direction and a y direction is completed, thereby completing initial positioning retrieval matching, wherein N is an integer, and L is a constant; step V: successively extracting points having the value of 1, updating a current extracted point having the value of 1 as the origin, updating the retrieval matching stride to L/2, performing the retrieval matching successively in the x direction without performing the retrieval matching on points previously subjected to the retrieval matching, automatically halving the retrieval matching stride when the retrieval matching extends beyond a range of the target image, and continuing the retrieval matching until the retrieval matching stride comes to a minimum; defining a new point having the value of 1 appearing during the retrieval matching as a new point needing to be subjected to the retrieval matching in the y direction, performing step VI; when no new point having the value of 1 appears, performing step VII; step VI: performing the retrieval matching successively in the y direction with keeping the retrieval matching stride of L/2 unchanged and without performing the retrieval matching on points previously subjected to the retrieval matching, automatically halving the retrieval matching stride when the retrieval matching extends beyond the range of the target image, and continuing the retrieval matching until the retrieval matching stride comes to a minimum; defining a new point having the value of I appearing during the retrieval matching as a new point needing to be subjected to the retrieval matching in the x direction, performing step V; when no new point having the value of 1 appears, performing step VII; step VII: ending the retrieval matching until no new point needs to be subjected to the retrieval matching, and defining a region with the points having the value of 1 obtained by the retrieval matching as an effective target image; step VIII: performing a search matching analysis on the effective target image in the historical speech feature map library; and step IX: invoking a corresponding strategy according to a recognition result.
5. The cloud service platform system for the speakers according to claim 4, wherein step VIII further comprises image correction comprising the following steps: step a: defining the effective target image as 1111, and selecting and defining any reference image in the historical speech feature map library as lc; step b: defining an association relationship between the reference image lc and the target image ql obtained by polar coordinate transformation: IT (r, co) = (azr,(p - wherein a, is a scale offset parameter, and yoz is a rotation offset parameter; step c: calculating, in the radial direction in a polar coordinate system, a projection KC(i) of the reference image Ic: Kc(i) = 12i E jTh.C°1 /PC (Li), and a projection 1<7 0) of the target image qvi: Kr(0,.(21f1filp"(i,j), taking logarithms of Kc (i) and Kf (i) to obtain LKc(i) and LKII (i), and using a translational difference between LICc(i) and LKI1 (i) as the scale offset parameter a,, wherein IP(i, j) = 1(Kmax Ki sin ( Kmax ± IC1 cos ( )) =1, 2, j=1, 2, h7; 11,p 11q) fl = ti;incp; and = niU 1) -Plat(' -1)1,77 =1- ; &II wherein 71; is a number of samples in an angular direction when Ki=Kmax; fl() represents a maxi mum integer which is less than or equal to the value within the bracket; the target image has a siz c of 2K,1""x2Kmax; nr=K""", representing a number of samples in the radial direction; and ticp=8Ki, representing a number of samples in the angular direction; step d: calculating projections of the reference image lc and the target image 1111 in the radial direction and the angular direction according to the scale offset parameter in step c: ----c a. [ri,1--1 Pc (i, ce (-1))+ ilij1Pc(L))i, a, > 1 0 = y"z i 121 ni and 0, a, < I 74 = En [raz nii/Pli i, ce Ca7) + 777i ipr i, ce (17) z i=i m 0, az > 1 " and performingperforming a normalization calculation on 0 I' and 0 z to obtain a translation amount d of the highest point, and calculating the rotation offset parameter yo, according to yoz = 27rd/ic, wherein ce() represents a minimum integer which is greater than or equal to the value within a bracket; step e: putting the rotation offset parameter yoz and the scale offset parameter a, to step A to correct the target image, and calculating, as a center point of the target image, a position point PP ic -----M.\I -----corresponding to a minimum of E z by E ' , = EL. p [o0.) z -®C 0:-cid, thereby completing image correction.
6. The cloud service platform system for the speakers according to claim 4, wherein the search matching analysis in step VIII further comprises the following steps: step A: making concentric circles with a center point of the target image Was the center to divide the speech feature image into B annular regions, and finally, dividing each annular region into K sectors, K and B both being predefined constants; step B: calculating, as Codel, a sector speech feature value Vse of each sector Ssq: Vsge= -nsq (Ensq I I Fsqo y) -Pso F I); wherein Fsqq(x,y) represents a gray value of each pixel of the sector Ssq; 13,0 represents an average value of gray values of pixels in the sector Ssq; list, represents a number in the annular region Ssq; 0< sq<BxK-1, O=(0°, (360Q/K), 2*(360c/K), 3*(3607K),...<1801; step C: rotating the speech feature image (180°/K), repeating step B, and extracting a sector speech feature value Vsq0 of each sector Sq. as Code2; step E: rotating Codel and Code2 Rx(360c/K) (R=0, 1, 2, ..., K-1) to obtain Codel' and Code2', respectively; and step F: inputting Codel and Code2, and Codel' and Code2' in step E to the historical speech feature map library for matching.