数据服务管理平台自动生成的SQL的查询优化 ,查询优化包括mysql
同样的计算逻辑,可以有高效的SQL写法,也可以有低效的SQL写法。两者的运行效率差距很大,传统的成熟的数据库,如mysql,他们做了很多sql上的优化,可以将用户编写的效率低的SQL转化成效率高的SQL,进而加快SQL的执行。
特别是在一些利用SQL进行自动化建模查询的场景(如数据服务的数据集编辑,指标查询建模等),往往是先构建查询的模式(既先构建JOIN和子查询等模式),最后再添加过滤条件,如果没有条件自动化条件下推,那对自动化建模是一个极大的挑战。
假设 table_a有字段 date, device_id, a,b, c,d,e
table_b有字段 date, device_id, x,y
当前编辑数据集(JOIN table_a 和table_b)后再生产API后的SQL是这样子的
select count(1) as cnt, date, b from ( select table_left.date as date, table_left.divice_id as divice_id, a, b, c, d, e, table_right.date_right, table_right.divice_id_right, x, y from ( select date, device_id, a, b, c, d, e from table_a ) table_left join ( select date, device_id, x, y from table_b ) table_right USING( table_left.date = table_right.date and table_left.divice_id = table_right.divice_id ) ) where date = '2020-04-01' and a = 10 and x = 20 group by date, b
需要改写上述 SQL 为:
select count(1) as cnt, date, b from ( select table_left.date as date, table_left.divice_id as divice_id, a, b table_right.date_right, table_right.divice_id_right, x from ( select date, device_id, a, b from table_a where date = '2020-04-01' and a = 10 ) table_left join ( select date, device_id, x from table_b where date = '2020-04-01' and x = 20 ) table_right USING( table_left.date = table_right.date and table_left.divice_id = table_right.divice_id ) ) where date = '2020-04-01' and a = 10 and x = 20 group by date, b
条件下推
case1: where 条件中使用了别名或者表达式, 比如 tt是一个表达式。如果不进行条件下推,会先把整个表table1的数据都读出来,再执行过滤条件,会造成大量的磁盘IO。
select p_date as date, floor(col1) -3600 as tt from ( select p_date, col1, col2, col3 from table1 ) as xx where date = '2020-01-01' and tt > col2 ===> select p_date as date, floor(col1) -3600 as tt from ( select p_date, col1, col2, col3 from table1 where p_date = '2020-01-01' and floor(col1) -3600 > col2 ) as xx where date = '2020-01-01' and tt > col2
case2: 子查询中包含聚合,外层查询中的where需要下推成子查询的where和having
select p_date as date, cnt from ( select p_date,count(col1) as cnt from table1 group by p_date ) as xx where date = '2020-01-01' and cnt > 10 ===> select p_date as date, cnt from ( select p_date,count(col1) as cnt from table1 where p_date = '2020-01-01' group by p_date having count(col1) > 10 ) as xx where date = '2020-01-01' and cnt > 10
字段裁剪
当存在子查询或JOIN时,如果最外层查询只引用了内部子查询或JOIN的部分列,则可以去掉子查询或JOIN中未被外层查询引用的列,减少数据读取时间&网络传输的开销。
例如:
select a, b, m,n from ( select a, b, c d, e, f, g, h, i, j, k from dataset1 ) as ds1 left join ( select m, n, o ,p ,q ,r,s,t ) as ds2 on (ds1.a=ds2.a)
在执行时,会将join的左表和右表的所有字段数据都从磁盘中读出,再进行JOIN,这造成大量的磁盘IO。所以我们将子查询中不必要的字段去掉,可以减少大量的IO开销
字段裁剪后的结果:
select a, b, m,n from ( select a, b from dataset1 ) as ds1 left join ( select m, n ) as ds2 on (ds1.a=ds2.a)
distributed_perfect_shard
在join以及count(distinct xxx)的场景,需要数据在不同节点之间传输计算,这在数据量大的情况下,会造成大量的网络传输,影响查询性能。而有的情况下,只需要每个节点单独计算,再简单合并计算结果也可以得到正确的结果。
去除子查询
有一些场景是可以把子查询去掉,把一个包含子查询的查询转换成一个不包含子查询的查询(或者是把一个n层子查询的查询转换成k层的子查询的查询, k < n)
举例
原SQL: select a from ( select b,c, d + 1 as a from xxx ) where a = 10 and b = (select a from (select a from xxx)) ==> 目标SQL: SELECT `d` + 1 AS `a` FROM `xxx` WHERE `a` = 10 AND `b` = (SELECT `a` AS `a` FROM `xxx`)
优化方法包括:谓词下推、列裁剪、子查询合并
下面是一些案例的展示及说明:
谓词下推
两表Join - exp1
优化后SQL将表达式 t1.c > 1 下推到表 t1 的子查询中。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1 ) t2 ON t1.b = t2.b WHERE t1.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE (`a` > 1) AND (`c` > 1)) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE `a` > 1) AS `t2` ON `t1`.`b` = `t2`.`b`
两表Join - exp1 and exp2
原SQL的表达式 t1.c > 1 和 t2.c > 1 分别下推到表t1和t2的子查询中。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1 ) t2 ON t1.b = t2.b WHERE t1.c > 1 and t2.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE (`a` > 1) AND (`c` > 1)) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE (`a` > 1) AND (`c` > 1)) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE `t1`.`c` > 1
两表Join - exp1 or exp2
表达式 t1.c > 1 和 t2.c > 1 的逻辑运算符是or,且两个表达式属于不同的表,所以不能下推
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1) t2 ON t1.b = t2.b WHERE t1.c > 1 or t2.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE `a` > 1) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE `a` > 1) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE `t1`.`c` > 1 OR `t2`.`c` > 1
表达式 t1.c > 1 和 t1.c < 10 的逻辑运算符是or,但两个表达式属于同一个表,所以可以看成一个表达式进行下推。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1) t2 ON t1.b = t2.b WHERE t1.c > 1 or t1.c < 10 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE (`a` > 1) AND (`c` > 1 OR `c` < 10)) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE `a` > 1) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE `t1`.`c` > 1 OR `t1`.`c` < 10
两表join - (exp1 or exp2) and exp3
表达式 t1.c > 1 和 t1.c < 10,属于同一个表,可以看成一个表达式一起下推;
所以表达式 t1.c > 1 or t1.c < 10 和 t2.c > 1 都可以下推。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1) t2 ON t1.b = t2.b WHERE (t1.c > 1 or t1.c < 10) and t2.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE (`a` > 1) AND (`c` > 1 OR `c` < 10)) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE (`a` > 1) AND (`c` > 1)) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE (`t1`.`c` > 1 OR `t1`.`c` < 10)
表达式 t1.c > 1 和 t2.c < 10,不属于同一个表,不可以看成一个表达式,因此不能一起下推;
只有表达式 t2.c > 1 可以下推。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1) t2 ON t1.b = t2.b WHERE (t1.c > 1 or t2.c < 10) and t2.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE `a` > 1) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE (`a` > 1) AND (`c` > 1)) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE (`t1`.`c` > 1 OR `t2`.`c` < 10)
两表Join - exp1 or exp2 and exp3
等同于exp1 or exp2_3,无论exp2与exp3是否属于一个表,都不能下推。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1) t2 ON t1.b = t2.b WHERE t1.c > 1 or t2.c < 10 and t2.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE `a` > 1) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE `a` > 1) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE `t1`.`c` > 1 OR `t2`.`c` < 10
join 中的下推传递
由于 b 是表 t1 和 t2 join 的关联字段,因此表达式 b > 1 将分别下推到表 t1 和 t2 的子查询中。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1 ) t2 ON t1.b = t2.b WHERE t1.b > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE (`a` > 1) AND (`b` > 1)) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE (`a` > 1) AND (`b` > 1)) AS `t2` ON `t1`.`b` = `t2`.`b`
子查询中有group by
表达式 t2.b > 2 下推到表 t2 子查询中的having中。
-- 原SQL SELECT t1.a, t1.c, t2.b FROM ( SELECT a, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, count(1) b FROM t2 group by a ) t2 ON t1.a = t2.a WHERE t2.b > 2 -- 优化后SQL SELECT `t1`.`a`, `t1`.`c`, `t2`.`b` FROM (SELECT `a`, `c` FROM `t1` WHERE `a` > 1) AS `t1` JOIN (SELECT `a`, count(1) AS `b` FROM `t2` GROUP BY `a` HAVING `b` > 2) AS `t2` ON `t1`.`a` = `t2`.`a`
子查询中有limit
由于表 t2 的子查询中包含 limit,下推后语义会变,所以表达式 t2.c > 1 不能下推。
表达式 t1.c > 1 下推到表 t1 的子查询中。
-- 原SQL SELECT t1.a, t1.b, t2.c FROM ( SELECT a, b, c FROM t1 where a > 1 ) t1 JOIN ( SELECT a, b, c FROM t2 where a > 1 limit 10) t2 ON t1.b = t2.b WHERE t1.c > 1 and t2.c > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`b`, `t2`.`c` FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE (`a` > 1) AND (`c` > 1)) AS `t1` JOIN (SELECT `a`, `b`, `c` FROM `t2` WHERE `a` > 1 LIMIT 10) AS `t2` ON `t1`.`b` = `t2`.`b` WHERE `t1`.`c` > 1
子查询中有union
表达式 t1_.a > 1 and t1_.b > 1 同时下推到表 t1 和 t2 的子查询中。
-- 原SQL select * from ( select a, b, c from t1 union select d, e, f from t2 ) as t1_ where t1_.a > 1 and t1_.b > 1 -- 优化后SQL SELECT * FROM (SELECT `a`, `b`, `c` FROM `t1` WHERE `a` > 1 AND `b` > 1 UNION SELECT `d` AS `a`, `e` AS `b`, `f` AS `c` FROM `t2` WHERE `a` > 1 AND `b` > 1 ) AS `t1_` WHERE `t1_`.`a` > 1
多表Join - t1 join t2 join t3
表达式 t1.a > 1 和 t2.a > 1 和 t3.a > 1分别下推到表t1、t2、t3的子查询中。
-- 原SQL SELECT t1.a, t2.b, t3.d FROM ( SELECT a, b, d FROM t1 where b > 1) t1 JOIN ( SELECT a, b, d FROM t2 where b > 1) t2 ON t1.d = t2.d JOIN ( SELECT a, b, d FROM t3 where b > 1) t3 ON t1.d = t3.d WHERE t1.a > 1 and t2.a > 1 and t3.a > 1 -- 优化后SQL SELECT `t1`.`a`, `t2`.`b`, `t3`.`d` FROM (SELECT `a`, `b`, `d` FROM `t1` WHERE (`b` > 1) AND (`a` > 1)) AS `t1` JOIN (SELECT `a`, `b`, `d` FROM `t2` WHERE (`b` > 1) AND (`a` > 1)) AS `t2` ON `t1`.`d` = `t2`.`d` JOIN (SELECT `a`, `b`, `d` FROM `t3` WHERE (`b` > 1) AND (`a` > 1)) AS `t3` ON `t1`.`d` = `t3`.`d` WHERE `t1`.`a` > 1 AND `t2`.`a` > 1
嵌套Join - t1 join ( t2 join t3 )
-- 原SQL SELECT t1.a, t1.d, t2.a, t2.b, t2.c FROM ( SELECT a, b, d FROM t1 ) t1 JOIN ( SELECT a, b, c, t2.d FROM t2 JOIN ( SELECT a, b, d FROM t3 ) t3 ON t2.d = t3.d ) t2 ON t1.d = t2.d WHERE t1.a > 1 AND t2.c > 1 AND t2.b > 1 -- 优化后SQL SELECT `t1`.`a`, `t1`.`d`, `t2`.`a`, `t2`.`b`, `t2`.`c` FROM ( SELECT `a`, `b`, `d` FROM `t1` WHERE `a` > 1) AS `t1_` JOIN ( SELECT `a`, `b`, `c`, `t2`.`d` FROM `t2` JOIN ( SELECT `a`, `b`, `d` FROM `t3` WHERE `b` > 1) AS `t3` ON `t2`.`d` = `t3`.`d` WHERE (`c` > 1) AND (`b` > 1) ) AS `t2_` ON `t1_`.`d` = `t2_`.`d` WHERE `t1_`.`a` > 1 AND `t2_`.`c` > 1
列裁剪
join 中
表 t1 中的列 c 和表 t2 中的列 c 被裁剪掉了。
-- 原SQL SELECT t1.a FROM ( SELECT a, b, c FROM t1) t1 JOIN ( SELECT a, b, c FROM t2) t2 ON t1.b = t2.b -- 优化后SQL SELECT `t1`.`a` FROM (SELECT `a`, `b` FROM `t1`) AS `t1` JOIN (SELECT `a`, `b`
union 中
表 t1 中的列 c 和 表 t2 中的列 f 被裁剪掉了。
-- 原SQL select a, b from ( select a, b, c from t1 union select d, e, f from t2 ) t1_ -- 优化后的SQL SELECT `a`, `b` FROM (SELECT `a`, `b` FROM `t1` UNION SELECT `d` AS `a`, `e` AS `b` FROM `t2`
子查询合并
-- 原SQL select a, b from ( select a, b, c from t1 where a > 1 ) -- 优化后SQL SELECT `a` AS `a`, `b` AS `b` FROM `t1`