SAS/ACCESS generates SAS table properties only when creating
a new Hive table. Many or perhaps all of your Hive tables are created
by other means. For example, your Hadoop administrator might create
Hive table definitions by submitting DDL scripts to the Hive CLI.
SAS and SAS users can benefit by adding SAS table properties to existing
Hive table definitions. Here is an example, where a Hive table has
already been defined.
CREATE EXTERNAL TABLE weblogs (extract_date STRING,
extract_type INT, webdata STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/web_data'
Based on this table
definition, here is how SAS interprets the columns.
libname hdp sasiohdp server=dbihadoop user=hadoop_usr pwd= hadoop_pwd;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=2012-02-21 extract_type=1
webdata=http://www.sas.com/industry/oilgas
NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
1 extract_date Char 32767 $32767. $32767. extract_date
2 extract_type Num 8 11. 11. extract_type
3 webdata Char 32767 $32767. $32767. webdata
Notice that Hive describes
the
extract_date
column to SAS as a
32767 length STRING. It also describes the
webdata
column
as a 32767 length STRING. So
SAS/ACCESS enters both of these columns
as character data and uses $32767. to format them. The result is an
overly wide SAS data set with an observation (row) width of 64 kilobytes
that also does not format
extract_date
to
a SAS DATE.
SAS issues a warning
message for this situation, which includes the maximum column length
that was in the result set. In the example, the maximum length read
for the extract_date STRING column is 10 bytes. The maximum length
read for the webdata STRING column was 320 bytes.
WARNING:
SAS/ACCESS assigned these columns a length of 32767. If resulting
SAS character variables remain this length, SAS performance is impacted.
See SAS/ACCESS documentation for details. Columns followed by the
maximum length observed were: extract_date:10, webdata:320
The example below assumes
that the length of the
webdata
STRING
in Hive never exceeds 1000 characters. A Hadoop user ID with the appropriate
authority can issue Hive ALTER TABLE statements to add SAS table properties
to the Hive table definition.
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:extract_date'='DATE(9.0)')
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:webdata'='CHAR(1000)')
SAS/ACCESS honors the
added properties, and here is the result.
libname hdp sasiohdp server=dbihadoop user=hadoop_usr pwd= hadoop_pwd;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=21FEB2012 extract_type=1
webdata=http://www.sas.com/industry/oilgas
NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
1 extract_date Num 8 DATE9. DATE9. extract_date
2 extract_type Num 8 11. 11. extract_type
3 webdata Char 1000 $1000 $1000. webdata
The resulting SAS data
set that is created from the Hive table has a much smaller observation
width, which helps SAS save disk space and reduce CPU consumption.
It also automatically converts and formats
extract_date
to
SAS standard DATE9. format.
Adding SAS properties
to existing Hive tables does not impact table use by software that
is not SAS. You can also issue ALTER TABLE and other DDL statements
using
SAS/ACCESS explicit SQL.
(See SQL Pass-Through Facility Specifics for Hadoop). Issuing such DDL as an ALTER TABLE
statement can be restricted to only the Hadoop administrator.