为什么Hive中的group by后面不能使用字段别名？

为什么Hive中的group by后面不能使用字段别名呢？看了这篇文章你将一目了然。

1. 案例说明

我们以一个例子来说明。

（1）数据

创建business.txt，内容如下：

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

（2）创建表

create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

（3）导入数据

1 2	load data local inpath "data/business.txt" into table business;

（4）查找每月购物的人次

使用字段别名报错：

1
2
3

select substring(orderdate,1,length(orderdate)-3) month_value, count(name)
from business
group by month_value;

报错如下：

1	FAILED: SemanticException [Error 10004]: Line 3:9 Invalid table alias or column reference 'month_value': (possible column names are: name, orderdate, cost)

不使用字段别名可以正常运行：

1
2
3

select substring(orderdate,1,length(orderdate)-3) month_value, count(name)
from business
group by substring(orderdate,1,length(orderdate)-3);

查询结果为：

month_value	_c1
2017-01	6
2017-02	1
2017-04	5
2017-05	1
2017-06	1

2. 分析原因

为了什么使用别名会报错呢？因为Sql语句执行顺序为：

(7)    SELECT
(8)    DISTINCT <select_list>
(1)    FROM <left_table>
(3)    <join_type> JOIN <right_table>
(2)    ON <join_condition>
(4)    WHERE <where_condition>
(5)    GROUP BY <group_by_list>
(6)    HAVING <having_condition>
(9)    ORDER BY <order_by_condition>
(10)   LIMIT <limit_number>

从上述sql的执行顺序可以看出，Group by之所以不能使用字段别名，是因为当执行到第五步GROUP BY时，第七步的SELECT还没有执行。由于ORDER BY在SELECT后面，所以可以使用字段别名。

另外，FROM语句是第一步执行的，故所有的关键字都可以使用表别名，这个一定要注意。

3. 解决办法

我们可以通过子查询的方式解决group by语句不能使用别名的问题，例如查找每月购物的人次，可以通过下列sql语句：

select
tmp.month_value,count(tmp.name)
from (
select substring(orderdate,1,length(orderdate)-3) month_value,name from business
) tmp
group by tmp.month_value;

由于最先执行的是最外层from后面的语句，所以group by可以使用子查询里面的字段别名。

4. 执行效率

分析下列sql

### 第一个sql
select substring(orderdate,1,length(orderdate)-3) month_value, count(name)
from business
group by substring(orderdate,1,length(orderdate)-3);

我们可以看到，下列sql语句有两个substring(orderdate,1,length(orderdate)-3)，这是不是代表该语句中的方法执行了两次呢？其实并不是，Hive里面经过了优化，这样写并不会增加运行耗时。

这个结论我是参考网友的结论，可能不太准确，我从运行时间上做了一个简单的验证

### 第二个sql
select count(name)
from business
group by substring(orderdate,1,length(orderdate)-3);

上面两个sql语句：

第一个sql运行时间：1mins, 10sec
第二个sql运行时间：1mins, 6sec

可以看到，并没有增加多少运行时间。

5. 总结

所有的关键字后面都可以用表的别名，而字段的别名根据sql的执行顺序来判断。

但是，在mysql中：

group by中可以使用字段别名
where中不能使用别名
order by中可以使用别名

mysql特殊是因为mysql中对查询做了加强。其余像oracle、hive中别名的使用都是严格遵循sql执行顺序的。

【参考资料】

https://www.cnblogs.com/aspirant/p/12626203.html