The most detailed series of Hive articles in the entire network, it is strongly recommended to collect and pay attention!
Later updated articles will list historical article directories to help you review the key points of knowledge.
table of Contents
Series of historical articles
A high-quality blog in the field of big data in 2021 will take you from entry to proficiency. The blog is updated every day to gradually improve articles on various knowledge systems of big data to help everyone learn more efficiently.
Those interested in big data can follow the WeChat public account: 3.Gangs of Big Data
hive query syntax
1. the SELECT statement
1. Statement structure
SELECT [ALL | DISTINCT]select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING where_condition] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] Copy code
1. ORDER BY is used for global sorting, that is, to perform global sorting on all the specified sort keys, use ORDER BY query statements, and finally use a Reduce Task to complete the global sorting. Explanation:
2. Sort by is used for sorting within the partition, that is, sorting within each Reduce task** . **, sort by only guarantees that the output of each reducer is in order, and does not guarantee global order.
3. ** distribute by ( ** field) divide the data into different reducers according to the specified field, and the distribution algorithm is hash.
4. ** cluster by ( ** field) In addition to the function of Distribute by, it also has the sorting function of sort by. .
Therefore, if the distribute by and sort by fields are the same, at this time, cluster by = distribute by + sort by
2. Full table query
select * from score; copy code
3. Select a specific column to query
select sid ,cid from score; copy code
4. Column alias
select sid as myid ,cid from score; copy code
5, commonly used functions
1) Find the total number of rows (count) select count(1) from score; 2) Find the maximum value of the score (max) select max(sscore) from score; 3) Find the minimum value of the score (min) select min(sscore) from score; 4) Find the sum of scores (sum) select sum(sscore) from score; 5) Find the average of scores (avg) select avg(sscore) from score; Copy code
6, LIMIT statement
A typical query will return multiple rows of data. The LIMIT clause is used to limit the number of rows returned.
select * from score limit 3; copy the code
7, WHERE statement
1) Use the WHERE clause to filter out rows that do not meet the conditions.
2) The WHERE clause immediately follows the FROM clause.
3) Case practice
Query data with a score greater than 60
select * from score where sscore> 60; copy the code
2. the operator
1, comparison operators operators
|Operator||Supported data types||description|
|A=B||Basic data type||If A is equal to B, it returns TRUE, otherwise it returns FALSE|
|A<=>B||Basic data type||If both A and B are NULL, it returns TRUE, the other is the same as the result of the equal sign (=) operator, if either is NULL, the result is NULL|
|A<>B, A!=B||Basic data type||If A or B is NULL, it returns NULL; if A is not equal to B, it returns TRUE, otherwise it returns FALSE|
|A****||Basic data type||If A or B is NULL, it returns NULL; if A is less than B, it returns TRUE, otherwise it returns FALSE|
|A<=B||Basic data type||If A or B is NULL, it returns NULL; if A is less than or equal to B, it returns TRUE, otherwise it returns FALSE|
|A>B||Basic data type||If A or B is NULL, it returns NULL; if A is greater than B, it returns TRUE, otherwise it returns FALSE|
|A>=B||Basic data type||If A or B is NULL, it returns NULL; if A is greater than or equal to B, it returns TRUE, otherwise it returns FALSE|
|A [NOT] BETWEEN B AND C||Basic data type||If any of A, B or C is NULL, the result is NULL. If the value of A is greater than or equal to B and less than or equal to C, the result is TRUE, otherwise it is FALSE. If the NOT keyword is used, the opposite effect can be achieved.|
|A IS NULL||All data types||If A is NULL, it returns TRUE, otherwise it returns FALSE|
|A IS NOT NULL||All data types||If A is not equal to NULL, it returns TRUE, otherwise it returns FALSE|
|IN (number 1, value 2)||All data types||Use the IN operation to display the values in the list|
|A [NOT] LIKE B||STRING type||B is a simple regular expression in SQL. If A matches it, it returns TRUE; otherwise, it returns FALSE. The expression of B is explained as follows:'x%' means A must start with the letter'x','%x' means A must end with the letter'x', and'%x%' means A contains the letter'x', It can be at the beginning, end or in the middle of the string. If the NOT keyword is used, the opposite effect can be achieved.|
|A RLIKE B, A REGEXP B||STRING type||B is a regular expression. If A matches it, it returns TRUE; otherwise, it returns FALSE. The matching is implemented using the regular expression interface in the JDK, because the regular expression is also based on the rules. For example, the regular expression must match the entire string A, not just its string.|
2. Case practice
(1) Query all data with a score equal to 80 select * from score where sscore = 80; (2) Query all data with a score of 80 to 100 select * from score where sscore between 80 and 100; (3) Query all data with empty results select * from score where sscore is null; (4) Query data whose score is 80 or 90 select * from score where sscore in(80,90); Copy code
3. LIKE and RLIKE
1) Use the LIKE operation to select similar values
2) The selection criteria can include characters or numbers:
% Represents zero or more characters (any number of characters).
_ Represents a character.
3) RLIKE clause is an extension of this function in Hive, which can specify matching conditions through Java's regular expression, a more powerful language.
4) Case practice
(1) Find all scores starting with 8 select * from score where sscore like '8%'; (2) Find all score data with the second value of 9 select * from score where sscore like'_9%'; (3) Find all score information with 1 in the id select * from score where sid rlike''; Copy code
2, logical operators operators
(1) Query data whose score is greater than 80 and sid is 01 select * from score where sscore >80 and sid = '01'; (2) The query result is greater than 80, or the sid is a number of 01 select * from score where sscore> 80 or sid = '01'; (3) Query students whose sid is not 01 and 02 select * from score where sid not in ('01','02'); Copy code
1. GROUP BY statement
The GROUP BY statement is usually used with aggregate functions to group by one or more queued results, and then perform aggregation operations on each group. Note that after using group by grouping, the fields after select can only be grouped fields and aggregate functions.
1) Calculate the average score of each student
select sid ,avg(sscore) from score group by sid; copy code
2) Calculate the highest score of each student
select sid ,max(sscore) from score group by sid; copy code
2, HAVING statement
1. The difference between having and where
(1) Where plays a role for the columns in the table to query data; having plays a role for the columns in the query results to filter data.
(2) The grouping function cannot be written after where, and the grouping function can be used after having.
(3) Having is only used for group by grouping statistics statement.
** 2. Case practice: **
- Find the average score of each student select sid ,avg(sscore) from score group by sid; - Find the people whose average score is greater than 85 for each student select sid ,avg(sscore) avgscore from score group by sid having avgscore> 85; Copy code
4. JOIN statement
Hive's join operation only supports equal joins
1, inner join (INNER JOIN)
Inner join: Only the data that matches the join conditions in the two tables that are connected will be retained.
select * from teacher t, course c where t.tid = c.tid; #Implicit inner connection select * from teacher t inner join course c on t.tid = c.tid; #Explicit inner join select * from teacher t join course c on t.tid = c.tid; Copy code
2, LEFT OUTER JOIN
Left outer join: All records in the table on the left side of the JOIN operator that meet the WHERE clause will be returned.
Query the courses corresponding to the teacher
select * from teacher t left join course c on t.tid = c.tid; Copy code
3, RIGHT OUTER JOIN
Right outer join: All records in the table to the right of the JOIN operator that meet the WHERE clause will be returned.
select * from teacher t right join course c on t.tid = c.tid; copy code
4, FULL OUTER JOIN
Full outer join: All records in all tables that meet the conditions of the WHERE statement will be returned. If the specified field of any table does not have a qualified value, then the NULL value is used instead.
SELECT * FROM teacher t FULL JOIN course c ON t.tid = c.tid ; Copy code
5, multi-table connection
Note: To connect n tables, at least n-1 connection conditions are required. For example: to connect three tables, at least two connection conditions are required.
Multi-table join query, query the teacher's corresponding course, and the corresponding score, the corresponding student
select * from teacher t left join course c on t.tid = c.tid left join score s on s.cid = c.cid left join student stu on s.sid = stu.sid; Copy code
In most cases, Hive will start a MapReduce task for each pair of JOIN connection objects. In this example, a MapReduce job will be started to connect the table teacher and the table course, and then a MapReduce job will be started to connect the output of the first MapReduce job with the table score; for the connection operation.
1. Order By-Global Sorting
Order By: global ordering, one reduce
1. Sort using the ORDER BY clause
ASC (ascend): Ascending order (default)
DESC (descend): descending order
2. The ORDER BY clause is at the end of the SELECT statement.
3. Case practice
(1) Query the students' scores and sort them in descending order of scores
SELECT * FROM student s LEFT JOIN score sco ON s.sid = sco.sid ORDER BY sco.sscore DESC; duplicated code
(2) Sort according to the average of the scores
select sid ,avg(sscore) avg from score group by sid order by avg; Copy code
(3) Sort by student id and average grade
select sid ,avg(sscore) avg from score group by sid order by sid,avg; copy code
2, the Sort MapReduce inside each partial sorting By-
Sort By: Sorting is performed internally in each MapReduce, not sorting for the global result set.
1) Set the number of reduce set mapreduce.job.reduces=3; 2) View the number of reduce settings set mapreduce.job.reduces; 3) Query results are sorted in descending order of results select * from score sort by sscore; 4) Import the query results into the file (arranged in descending order of grades) insert overwrite local directory'/export/data/exporthive/sort' select * from score sort by sscore; Copy code
3, Distribute By-partition sort
Distribute By: Similar to partition in MR, it is partitioned and used in combination with sort by.
Note that Hive requires the DISTRIBUTE BY statement to be written before the SORT BY statement.
For the test of distribute by, it is necessary to allocate multiple reduce for processing, otherwise the effect of distribute by cannot be seen.
First is divided according to student id area , and then sorted according to student achievement.
1) Set the number of reduce, divide our corresponding sid into the corresponding reduce set mapreduce.job.reduces=7; 2) Partition of data through distribute by insert overwrite local directory'/export/data/exporthive/distribute' select * from score distribute by sid sort by sscore; Copy code
4, Cluster By
When the distribute by and sort by fields are the same, the cluster by method can be used.
In addition to the function of distribute by, cluster by also has the function of sort by. But the sorting can only be in ascending order, and the sorting rule cannot be specified as ASC or DESC.
The following two ways of writing are equivalent:
select * from score cluster by sid; select * from score distribute by sid sort by sid; Copy code
- Blog homepage: lansonli.blog.csdn.net
- Welcome to like it Collection Leave a message Please correct me if there is any mistake!
- This article was originally written by Lansonli and first published on the CSDN blog
- The series of big data articles will be updated every day. When you stop and rest, don t forget that others are still running. I hope you all hurry up to learn and go all out for a better life.