Talk about join and where in hive

Talk about join and where in hive

Public number: Java Little Coffee Show , website: javaxks.com

Author: Yicun HUI, link: www.cnblogs.com/zsql/

If many people first come into contact with the execution order of mysql ( from ->on ->join ->where ->group by ->having ->select ->distinct ->order by ->limit ), they may be Where will produce some misunderstandings. There are also some blogs on the Internet that have written about these contents, but I still want to test it myself. This article mainly distinguishes between on and where from the conditions of inner join, left/right join and full join, as well as adding In fact, understanding the conditions of the upper partition is of great help to writing hql. Either it can be more concise or better, and then start to practice.

Version: This article uses CDH 6.3.2 version, hive 2.1.1+cdh6.3.2 for testing

1. Test table and data

1.1, build a table

create table `user`( - user table, partition table department_id int, age int, sex string, name string ) PARTITIONED BY (`date` string) row format delimited fields terminated by',' STORED AS TEXTFILE; create table department( - department table id int, name string, count int ) row format delimited fields terminated by',' STORED AS TEXTFILE; Copy code

1.2, data

-/data/hive/user1.txt 1,34,male,zhangsan 1,31,female,lili 3,14,female,liushen 3,24,female,sasa 4,54,male,liubei 4,36,female,yuji 4,25,male,zhaoyun 8,18,male,zhangfei -/data/hive/user2.txt 3,37,male,wangwu 4,38,female,lisi 3,19,female,caocao 2,22,female,guanyu 1,51,male,wzj 6,31,female,zhenji 6,25,male,sunwukong 6,17,male,tangsz -/data/hive/department.txt 1,dashuju,8 2,kaifa,9 3,ui,10 4,hr,3 5,shouxiao,12 6,officer, 3 Copy code

1.3, data import

load data local inpath'/data/hive/user1.txt' into table `user` partition (`date`='2020-12-24'); load data local inpath'/data/hive/user2.txt' into table `user` partition (`date`='2020-12-25'); load data local inpath'/data/hive/department.txt' into table `department`; Copy code

1.4, query data

SELECT * from

user
;

SELECT * from department;

1.5, analyze the table

ANALYZE TABLE `user` partition(`date`='2020-12-24') COMPUTE STATISTICS; ANALYZE TABLE `user` partition(`date`='2020-12-25') COMPUTE STATISTICS; ANALYZE TABLE department COMPUTE STATISTICS; Copy code

If you do not perform the above operations, problems will occur in the following practice, the data is not true, and the effect is not visible, so analyze the table before explaining, which is more accurate (just stepped on the pit, every time the explain comes out There is only 1 data, there is a problem with the statistics)

2. Analysis of on and where under inner join

2.1, do not use partitions for filtering

1. First look at the result of an unconditional inner join

SELECT * from `user` u inner join department d on d.id = u.department_id; duplicated code

View the execution plan:

explain SELECT * from `user` u inner join department d on d.id = u.department_id; duplicated code
STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: department_id is not null (type: boolean) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: department_id is not null (type: boolean) Statistics: Num rows: 16 Data size: 2944 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The above statement does not do any conditional filtering, and does not use partitions:

  • 1. perform a full table scan on the d table (department), and scan 6 pieces of data, and then add the filter condition of id **is not null by default to filter the 6 pieces of data scanned (automatically optimize, if you want to keep null under join If you need to do special processing in advance, or it will be optimized by default, which will lead to data loss), **After finally filtering, there are 6 pieces of data left to participate in the inner join.
  • Then a full table scan is performed on the u table (user), and 16 pieces of data are scanned, and the default filter condition department_id **is not null is also added . Finally, 16 pieces of data are obtained to participate in the inner join.

2. Next, let's look at the result of a where condition and an on condition

SELECT * from `user` u inner join department d on d.id=u.department_id and d.count> 9 and u.age> 20 where u.age <30; Copy code

Next look at the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: ((count> 9) and id is not null) (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((count> 9) and id is not null) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: ((age> 20) and department_id is not null and (age <30)) (type: boolean) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: ((age> 20) and department_id is not null and (age <30)) (type: boolean) Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The result is shown above:

  • First scan the d (department) table, scan the full table with 6 pieces of data, and filter them: ((count > 9) and id is not null), the filtered result leaves 2 pieces of data for inner join operation
  • Then scan the u (user) table, which is also the full table scan of 16 data, and filter it ((age **> 20) and department_id is not null and (age <30)), and filter the remaining 1 data (here There is an error. In fact, there should be 4 pieces of data remaining. The execution plan of hive is a rough statistical execution process, which is not completely correct.) **Perform inner join operation

**Small summary: **inner join does not use partition filtering, it will scan both of its tables, and then automatically add is not null to the join key (on d.id=u.department_id) Filter conditions, and then filter with the conditions behind on and where. In the inner join, where and on are filtered at the same time.

2.2, use partition filtering

1. First look at a situation with only partition filtering

SELECT * from `user` u inner join department d on d.id=u.department_id where u.`date`='2020-12-25'; Copy code

View the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: (department_id is not null and (date = '2020-12-25')) (type: boolean) Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE #Here there are only 8 data in one partition Filter Operator predicate: department_id is not null (type: boolean) Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10 Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-25' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The results are as above. Compared with the situation where no partition has been added, the table u (user) will not be scanned for the full table after the partition is added. This can improve efficiency, because the storage of the partition is a folder, so in the partition After filtering, you can specify the partition to scan, but the full table scan will not be performed. This situation shows: the partition table is partitioned and filtered, and then all the specified partitions are scanned, and then on and where and the automatically added is not Null condition is used for filtering, the filtered data is only used for inner join

2. Look at a result with conditions and partition filtering

SELECT * from `user` u inner join department d on d.id=u.department_id and d.count> 9 and u.age> 20 where u.age <30 and u.`date`='2020-12-24'; Copy code

Look at the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: ((count> 9) and id is not null) (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((count> 9) and id is not null) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: ((age> 20) and department_id is not null and (age <30)) (type: boolean) Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((age> 20) and department_id is not null and (age <30)) (type: boolean) Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The results are as above, and the results are compared with (not using partition condition filtering and using on and where filtering). It can be seen that the difference between using partition filtering is whether to scan all or scan the specified partition when scanning the table. Partition filtering, scan the entire table, otherwise, only scan the specified partition.

2.3. Summary of on and where under inner join

Under inner join, if partition filtering is used in the where condition, scan the data of the specified partition, and then filter through the where and on conditions, and add is not to the join key (on d.id=u.department_id) Null filter condition (here you need to pay attention to whether the data whose join key is null should be retained. If it needs to be retained, special processing on the join key is required, otherwise the data will be filtered out, resulting in data loss), here on and where It is filtered at the same time, without prioritizing.

3. Analysis of on and where under left/right join

Since left join and right join are of the same type, this article only focuses on left join in practice.

3.1, non-primary tables are executed under on and where conditions

First look at the result of an executed statement ( add filter conditions after on in the non-primary table ) (Convention: u (main table) left join d (non-main table) )

SELECT * from

user
u left join department d on d.id=u.department_id and d.count> 9

View Results

Then look at an execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: (count> 9) (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (count> 9) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The result is as above, we found that under the left join, the agreement: u (main table) left join d (non-main table) , the non-main table is on the condition d.count> 9 filtering is valid, and finally scan all 6 data, pass Conditionally filter the remaining 2 pieces of data and then perform a left join. The main table scans the entire table and performs a left join. Note here that the join keys (on d.id=u.department_id) of the two tables under the left join condition are not added with is Conditional filtering of not null, so you need to pay attention to whether the join key is empty when performing a left join, and it can be optimized if it is empty.

Look at the result of an executed statement ( add filter conditions after where in the non-primary table ) (convention: u (main table) left join d (non-main table) )

SELECT * from `user` u left join department d on d.id=u.department_id where d.count> 9 Copy code

The result is as follows: (the result of adding after the on is different from the non-main table)

Look at the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col10> 9) (type: boolean) Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

Judging from the above execution plan, the tables u (user) and d (department) are left join to scan the full table without any filtering. After the result is obtained by the left join, the result is then used Non-primary table where condition d.count> 9 for filtering

**Small summary: (left join)** The results of filtering using on or where in a non-primary table are different. If you add conditional filtering under on, first filter the data of the table, and then Perform a left join. If you add a conditional filter after where, first perform a left join, and then filter the results of the left join with the where condition. In the left join, the default is not null is not added to the join key. Filter conditions.

3.2, the main table is executed under on and where conditions

First look at the result of an executed statement ( add filter conditions after on in the main table ) (Convention: u (main table) left join d (non-main table) )

SELECT * from `user` u left join department d on d.id=u.department_id and u.age> 20 Copy code

Seeing the above results, it is found that 16 pieces of data (all data in the user table) are still retained, but it is found that the data with age<=20 does not seem to participate in the left join, and the following values are all null .

Look at the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator filter predicates: 0 {(age> 20)} 1 keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 filter predicates: 0 {(age> 20)} 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The results are as above. When processing the d (department table), scan the full table of 6 data, mark the condition of age>20 on the table d (department), and then perform a full table scan on the u (user) table and perform the full table Left join, in the process of left join, the d (department) table (fields derived from table d) is filtered by the condition age> 20 of the main table. If the age of table u is <=20, it is derived from table d The fields are all set to null , (as above is personal understanding, not necessarily correct. Simply put, make a judgment mark first, then perform left join, and filter by conditions in the process of left join (data that does not meet the conditions is retained in the main table) The data of the non-primary table is discarded and set to null)), here the condition after on will not affect the number of entries in the main table, and the left join is also performed first and filtered accordingly. It's more complicated to understand, so you can check the results yourself.

Look at the result of an executed statement ( add filter conditions after where in the main table ) (convention: u (main table) left join d (non-main table) )

SELECT * from `user` u left join department d on d.id=u.department_id where u.age> 20 Copy code

Next look at the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: (age> 20) (type: boolean) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: (age> 20) (type: boolean) Statistics: Num rows: 5 Data size: 920 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The result is as above. It can be clearly seen that when the where filter is used in the main table, the data of the main table will be filtered first and then the left join will be performed. The main table will scan out 16 data, and the remaining 5 will be left after filtering. join to get the final result .

Small summary : (left join) The results of filtering using on or where under the main table are different. When using where to filter the main table, filter first and then perform left join. When using on to filter the main table, first perform the filter mark on the non-primary table, and then set the data from the non-primary table in the unqualified rows to null according to the filter conditions when performing a left join on the whole table.

3.3, left/right join uses partition filtering

Look at the following statement:

SELECT * from `user` u left join department d on d.id=u.department_id where u.age> 20 and u.`date` = '2020-12-24'; Copy code

result:

Look at the execution plan:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: ((age> 20) and (date = '2020-12-24')) (type: boolean) Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (age> 20) (type: boolean) Statistics: Num rows: 2 Data size: 34 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

According to the above execution plan, it can be seen that the use of partition filtering can prevent the full table scan. As the above value scans a partition, there are only 8 data .

3.4. Summary of on and where under left/right join

  1. Use partition condition filtering to prevent full table scans, the highest priority filtering
  2. To filter under the main table, the results of filtering using on and where are different. When using where to filter the main table, filter first and then perform left join. When using on to filter the main table, first perform the filter mark on the non-primary table, and then set the data from the non-primary table in the unqualified rows to null according to the filter conditions when performing a left join on the whole table.
  3. For filtering under non-primary tables, the results of filtering using on and where are different. If you add conditional filtering under on, you will first filter the data of the table, and then perform left join, if you add conditions after where For filtering, the left join is performed first, and then the results obtained by the left join are filtered by the where condition
  4. The left/right join will not automatically add the filter condition of is not null to the join key, so pay attention to the case that the join key is null when the join key is left/right. This can be optimized

4. on and where analysis under full join

4.1, full join without filter conditions

Look directly at a full join without any conditions

SELECT * from `user` u full join department d on d.id=u.department_id Copy code

View the execution plan:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

The execution plan is as above, each table will be sorted in ascending order, without automatic optimization (adding null filtering), and a full join of the entire table will be executed.

4.2, full join with where conditions

SELECT * from `user` u full join department d on d.id=u.department_id where u.age> 20 and d.count> 9 Copy code

The result is as follows

View the execution plan:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((_col10> 9) and (_col1> 20)) (type: boolean) Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

It can be seen from the execution plan that when using where to filter under full join, the full table scan is performed first, and then the full join is performed. After the full join obtains the result, the conditions in where are filtered.

4.3. Full join with on condition (leave doubts)

SELECT * from `user` u full join department d on d.id=u.department_id and u.age> 20 and d.count> 9 Copy code

Seeing the above results, it may be a bit unexpected (personal ability is limited, powerful bloggers can explain and explain), personal understanding is like the on condition under the main table of the left join, which is filtered during the full join However, all the data in the two tables are retained, and only when both conditions are true, there is no null value. (If you don't understand in full join, try to use where conditional judgment)

View the execution plan:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 filter predicates: 0 {(VALUE._col0> 20)} 1 {(VALUE._col1> 9)} keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string) , _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

Have questions? ? ? ? , Good bloggers can leave a message below if they understand

4.4, full join with partition filtering

SELECT * from department d full join `user` u on d.id=u.department_id where u.`date`= '2020-12-24'; Copy code

View the execution plan:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 keys: 0 id (type: int) 1 department_id (type: int) outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col10 = '2020-12-24') (type: boolean) Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: string), _col9 (type: string) , '2020-12-24' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Copy code

According to the execution plan, it is known that in a full join, even if partition filtering is used, the result is obtained by the full join first, and then filtered by the where condition, so it is recommended to use the subquery to filter first, and then perform the full join.

4.5. Summary of on and where in full join

  1. There are still doubts under the on condition here. .
  2. Regardless of whether partition filtering is used under the condition of where, the full join is performed first, and then the filtering is performed, so here is the subquery filtering, and then full join
  3. The condition that the join key is not null is not automatically added in full jion

5. summary

1. inner join

  • Inner join can filter by partition first to prevent full table scan.
  • Inner join will automatically add a filter condition of is not null to the join key (on d.id=u.department_id)
  • In inner join, the conditions behind on and where are filtered. In inner join, where and on are filtered at the same time, there is no difference in order

2. Left/right join

  • Left/right join uses partition condition filtering to prevent full table scans, filtering is the highest priority
  • The left/right join filters under the main table. The results of filtering using on and where are different. When using where to filter the main table, filter first and then perform left join. When using on to filter the main table, first perform the filter mark on the non-main table, and then set the data from the non-main table in the non-compliant rows to null according to the filter conditions when performing left join on the whole table.
  • The left/right join is filtered under the non-primary table. The results of filtering using on and where are different. If you add conditional filtering under on, the data of the table is filtered first, and then the left join is performed. If it is in Adding conditional filtering after where is to perform the left join first, and then filter the results of the left join with the where condition, so when filtering non-primary tables, you can use on to perform conditional filtering, so as to prevent writing sub-queries
  • The left/right join will not automatically add the filter condition of is not null to the join key, so pay attention to the case that the join key is null when the join key is left/right. This can be optimized

3. Full join

  • Filter under on conditions in full join (in doubt, to be explored)
  • In full join, where conditions are filtered, whether it is partition filtering or not, the full join is performed first, and the filtering is performed according to the conditions. It is recommended that the subquery be filtered before the full join
  • Will not automatically add is not null filter conditions to the join key