Coming to Impala from an RDBMS Background

dazheng 2014-11-12

展开全文

Coming to Impala from an RDBMS Background

When you come to Impala from a background with a traditional relational database product, you find the same familiar SQL query language and DDL statements. Data warehouse experts will already be familiar with the notion of partitioning. If you have only dealt with smaller OLTP-style databases, the emphasis on large data volumes will expand your horizons.

Standard SQL

The great thing about coming to Impala with relational database experience is that the query language is completely familiar: it’s just SQL! The SELECT syntax works like you are used to, with joins, views, relational operators, aggregate functions, ORDER BY and GROUP BY, casts, column aliases, built-in functions, and so on.

Because Impala is focused on analytic workloads, it currently doesn’t have OLTP-style operations such as DELETE, UPDATE, or COMMIT / ROLLBACK. It also does not have indexes, constraints, or foreign keys; data warehousing experts traditionally minimize their reliance on these relational features because they involve performance overhead that can be too much when dealing with large amounts of data.

The initial Impala release supports a set of core column data types: STRING instead of VARCHAR or VARCHAR2; INT and FLOAT instead of NUMBER; and no BLOB type.

The CREATE TABLE and INSERT statements incorporate some of the format clauses that you might expect to be part of a separate data-loading utility, because Impala is all about the shortest path to ingest and analyze data.

The EXPLAIN statement provides a logical overview of statement execution. Instead of showing how a query uses indexes, the Impala EXPLAIN output illustrates how parts of the query are distributed among the nodes in a cluster, and how intermediate results are combined at the end to produce the final result set.

Impala implements SQL-92 standard features with some enhancements from later SQL standards. It does not yet have does not yet have the SQL-99 and SQL-2003 analytic functions, although those items are on the product roadmap.

Storage, Storage, Storage

Several aspects of the Apache Hadoop workflow, with Impala in particular, are very freeing to a longtime database user:

The data volumes are so big that you start out with a large pool of storage to work with. This reality tends to reduce the bureaucracy and other headaches associated with a large and fast-growing database.
The flexibility of Impala schemas means there is less chance of going back and reorganizing old data based on recent changes to table structures.
The HDFS storage layer means that replication and backup are handled at the level of an entire cluster rather than for each individual database or table.

The key is to store the data in some form as quickly, conveniently, and scalably as possible through the flexible Hadoop software stack and file formats. You can come back later and define an Impala schema for existing data files. The data loading process for Impala is very lightweight; you can even leave the data files in their original locations and query them there.

Billions and Billions of Rows

Although Impala can work with data of any volume, its performance and scalability shine when the data is large enough to be impractical to produce, manipulate, and analyze on a single server. Therefore, after you do your initial experiments to learn how all the pieces fit together, you very quickly scale up to working with tables containing billions of rows and gigabytes, terabytes, or larger of total volume. The toy problems you tinker with might involve data sets bigger than you ever used before. You might have to rethink your benchmarking techniques if you are used to using smaller volumes—meaning millions of rows or a few tens of gigabytes. You will start relying on the results of analytic queries because the scale will be bigger than you can grasp through your intuition.

For problems that do not tax the capabilities of a single machine, many alternative techniques offer about the same performance. After all, if all you want to do is sort or search through a few files, you can do that plenty fast with Perl scripts or Unix commands such as grep. The Big Data issues come into play when the files are too large to fit on a single machine, or when you want to run hundreds of such operations concurrently, or when an operation that takes only a few seconds for megabytes of data takes hours when the data volume is scaled up to gigabytes or petabytes.

You can learn the basics of Impala SQL and confirm that all the prerequisite software is configured correctly using tiny data sets, as in the examples throughout this article. That’s what we call a “canary test,” to make sure all the pieces of the system are hooked up properly.

To start exploring scenarios involving performance testing, scalability, and multi-node cluster configurations, you typically use much, much larger data sets. Try generating a billion rows of representative data, then once the raw data is in Impala, experiment with different combinations of file formats, compression codecs, and partitioning schemes.

Don’t put too much faith in performance results involving only a few gigabytes of data. Only when you blow past the data volume that a single server could reasonably handle or saturate the I/O channels of your storage array can you fully appreciate the performance speedup of Impala over competing solutions and the effects of the various tuning techniques. To really be sure, do trials using volumes of data similar to your real-world system.

If today your data volume is not at this level, next year it might be. You should not wait until your storage is almost full (or even half full) to set up a big pool of HDFS storage on cheap commodity hardware. Whether or not your organization has already adopted the Apache Hadoop software stack, experimenting with Cloudera Impala is a valuable exercise to future-proof your enterprise.

How Impala Is Like a Data Warehouse

With Impala, you can unlearn some notions from the RDBMS world. Long-time data warehousing users might already be in the right mindset, because some of the traditional database best practices naturally fall by the wayside as data volumes grow and raw query speed becomes the main consideration. With Impala, you will do less planning for normalization, skip the time and effort that goes into designing and creating indexes, and stop worrying when queries cause full-table scans.

Impala, as with many other parts of the Hadoop software stack, is optimized for fast bulk read and data load operations. Many data warehouse-style queries involve either reading all the data (“what is the highest number of different visitors our website ever had in one day?”) or reading some large set of values organized by criteria such as time (“what were the total sales for the company in the fourth quarter of last year?”). Impala divides up the work of reading large data files across the nodes of a cluster. Impala also does away with the performance overhead of creating and maintaining indexes, instead taking advantage of the multimegabyte HDFS block size to read and process high volumes of data in parallel across multiple networked servers. As soon as you load the data, it is ready to be queried. Impala can run efficient ad hoc queries against any columns, not just preplanned queries using a small set of indexed columns.

In a traditional database, normalizing the data and setting up primary key / foreign key relationships can be time consuming for large data volumes. That is why data warehouses (and also Impala) are more tolerant of denormalized data, with values that are duplicated and possibly stored in raw string form rather than condensed to numeric IDs. The Impala query engine works very well for data warehouse-style input data by doing bulk reads and distributing the work among nodes in a cluster. Impala can even condense bulky, raw data into a data warehouse-friendly layout automatically as part of a conversion to the Parquet file format.

When executing a query involves sending requests to several servers in a cluster, the way to minimize total resource consumption (disk I/O, network traffic, and so on) is to make each server do as much local processing as possible before sending back the results. Impala queries typically work on data files in the multimegabyte or gigabyte range, where a server can read through large blocks of data very quickly. Impala does as much filtering and computation as possible on the server that reads the data to reduce overall network traffic and resource usage on the other nodes in the cluster. Thus, Impala can very efficiently perform “full table scans” of large tables, the kinds of queries that are common in analytical workloads.

Impala makes use of partitioning, another familiar notion from the data warehouse world. Partitioning is one of the major optimization techniques you will employ to reduce disk I/O and maximize the scalability of Impala queries. Partitioned tables physically divide the data based on one or more criteria, typically by date or geographic region, so that queries can filter out irrelevant data and skip the corresponding data files entirely. Although Impala can quite happily read and process huge volumes of data, your query will be that much faster and more scalable if a query for a single month only reads one-twelfth of the data for that year, or if a query for a single US state only reads one-fiftieth of the data for the entire country. Partitioning typically does not impose much overhead on the data loading phase; the partitioning scheme usually matches the way data files are already divided, such as when you load a group of new data files each day.

Your First Impala Queries

To get your feet wet with the basic elements of Impala query syntax such as the underlying data types and expressions, you can run queries without any table or WHERE clause at all:

SELECT 2+2;
SELECT SUBSTR('Hello world',1,5);
SELECT CAST(99.5 AS INT);
SELECT CONCAT('aaa',"bbb",'ccc');
SELECT 2 > 1;
SELECT NOW();

Because Impala does not have any built-in tables, running queries against real data requires a little more preparation. We’ll use the INSERT ... VALUES statement to create a couple of “toy” tables, although for scalability reasons we would quickly leave the VALUES clause behind when working with data of any significant volume.

-- Set up a table to look up names based on abbreviations.
CREATE TABLE canada_regions (name STRING, abbr STRING);
INSERT INTO canada_regions VALUES
  ("Newfoundland and Labrador" ,"NL"),
  ("Prince Edward Island","PE"),
  ("New Brunswick","NB"), ("Nova Scotia","NS"),
  ("Quebec","PQ"), ("Ontario","ON"),
  ("Manitoba","MB"), ("Saskatchewan","SK"), ("Alberta","AB"),
  ("British Columbia","BC"), ("YT","Yukon"),
  ("Northwest Territories","NT"), ("Nunavut","NU");

-- Set up a potentially large table
-- with data values we will use to answer questions.
CREATE TABLE canada_facts
  (id STRING, sq_mi INT, population INT);

-- The INSERT statement either appends to existing data in
-- a table via INSERT INTO, or replaces the data entirely
-- via INSERT OVERWRITE.

-- Here we start by inserting partial data...
INSERT INTO canada_facts VALUES ("NL",156453,514536),
  ("PE",2190,140204);
-- ... then we replace the entire contents of the table
-- with the complete data.
INSERT OVERWRITE canada_facts VALUES ("NL",156453,514536),
  ("PE",2190,140204), ("NB",28150,751171), ("NS",21345,921727),
  ("PQ",595391,8054756), ("ON",415598,13505900),
  ("MB",250950,1208268), ("SK",251700,1033381),
  ("AB",255541,3645257), ("BC",364764,4400057),
  ("YT",186272,33897), ("NT",519734,41462), ("NU",78715,31906);

-- A view is an alias for a longer query, and takes no time or
-- storage to set up.
-- Querying a view avoids repeating clauses over and over.
CREATE VIEW atlantic_provinces AS SELECT * FROM canada_facts
  WHERE id IN ('NL','PE','NB','NS');
CREATE VIEW maritime_provinces AS SELECT * FROM canada_facts
  WHERE id IN ('PE','NB','NS');
CREATE VIEW prairie_provinces AS SELECT * FROM canada_facts
  WHERE id IN ('MB','SK','AB');

-- We can query a single table, multiple tables via joins, or
build new queries on top of views.
SELECT name AS "Region Name" FROM canada_regions
  WHERE abbr LIKE 'N%';
+---------------------------+
| region name               |
+---------------------------+
| Newfoundland and Labrador |
| New Brunswick             |
| Nova Scotia               |
| Northwest Territories     |
| Nunavut                   |
+---------------------------+
-- Here we get the population figure from one table and the
-- full name from another.
SELECT canada_regions.name, canada_facts.population
  FROM canada_facts JOIN canada_regions
  ON (canada_regions.abbr = canada_facts.id);
+---------------------------+------------+
| name                      | population |
+---------------------------+------------+
| Newfoundland and Labrador | 514536     |
| Prince Edward Island      | 140204     |
| New Brunswick             | 751171     |
| Nova Scotia               | 921727     |
| Quebec                    | 8054756    |
| Ontario                   | 13505900   |
| Manitoba                  | 1208268    |
| Saskatchewan              | 1033381    |
| Alberta                   | 3645257    |
| British Columbia          | 4400057    |
| Northwest Territories     | 41462      |
| Nunavut                   | 31906      |
+---------------------------+------------+
-- Selecting from a view lets us compose a series of
-- filters and functions.
SELECT SUM(population) AS "Total Population"
  FROM atlantic_provinces;
+------------------+
| total population |
+------------------+
| 2327638          |
+------------------+
SELECT AVG(sq_mi) AS "Area (Square Miles)"
  FROM prairie_provinces;
+---------------------+
| area (square miles) |
+---------------------+
| 252730.3333333333   |
+---------------------+

Getting Data into an Impala Table

Because Impala is typically near the end of the extract-transform-load (ETL) pipeline, its focus is on working with existing data rather than creating data from scratch. Thus, you typically start with data files, then get them into Impala using one of these techniques:

Issue a LOAD DATA statement to move data files into the Impala data directory for a table.
Physically copy or move data files into the Impala data directory for a table. (Not needed as much now, since the LOAD DATA statement debuted in Impala 1.1.)
Issue a CREATE EXTERNAL TABLE statement with a LOCATION clause, to point the Impala table at data files stored in HDFS outside the Impala data directories.
Issue an INSERT ... SELECT statement to copy data from one table to another. You can convert the data to a different file format in the destination table, filter the data using WHERE clauses, and transform values using operators and built-in functions.
Use any of the Hive data loading techniques, especially for tables using the Avro, SequenceFile, or RCFile formats. Because Impala and Hive tables are interchangeable, once data is loaded through Hive, you can query it through Impala.

Pro Tip

If you are already using batch-oriented SQL-on-Hadoop technology through the Apache Hive component, you can reuse Hive tables and their data directly in Impala without any time-consuming loading or conversion step. This cross-compatibility applies to Hive tables that use Impala-compatible types for all columns.

For Beginners Only

Issue one or more INSERT ... VALUES statements to create new data from literals and function return values. We list this technique last because it really only applies to very small volumes of data, or to data managed by HBase. Each INSERT statement produces a new tiny data file, which is a very inefficient layout for Impala queries against HDFS data. On the other hand, if you are entirely new to Hadoop, this is a simple way to get started and experiment with SQL syntax and various table layouts, data types, and file formats, but you should expect to outgrow the INSERT ... VALUES syntax relatively quickly. You might graduate from tables with a few dozen rows straight to billions of rows when you start working with real data. Make sure to clean up any unneeded small files after finishing with INSERT ... VALUES experiments.