Create the Ride Sharing Tables
Before you get started, ensure that:
-
The Varada cluster is up and running
-
You can access the public Ride Sharing dataset S3 bucket at s3://varada-rideshare-demo
You create the tables for the Ride Sharing dataset in an existing schema or create a new schema. If you are using an existing schema, begin with step 2.
- In the Query Editor, create a new ride_sharing_dataset schema:
drop schema if exists varada.ride_sharing_dataset;
create schema varada.ride_sharing_dataset;
use varada.ride_sharing_dataset;
- In the Query Editor, create the trips_data table:
The prefix of the external_location attribute value in the commands to create the tables differs according to whether the cluster is connected to an AWS Glue metastore or to a Hive metastore. Use the appropriate command syntax for your system.
CREATE TABLE trips_data(
tripid int,
t_hour int,
t_min int,
d_weekday int,
t_start_ts timestamp,
fare double,
src_zone int,
dst_zone int,
duration double,
distance double,
rider_id int,
rider_age int,
rider_first varchar,
rider_last varchar,
rider_gender varchar,
driver_id int,
driver_age int,
driver_first varchar,
driver_last varchar,
driver_gender varchar,
ts timestamp,
lon double,
lat double,
point_id int,
last_point int,
d_date date)
WITH (
external_location= 's3://varada-rideshare-demo/trips_data',
partitioned_by = ARRAY['d_date'],
format = 'PARQUET')
CREATE TABLE trips_data(
tripid int,
t_hour int,
t_min int,
d_weekday int,
t_start_ts timestamp,
fare double,
src_zone int,
dst_zone int,
duration double,
distance double,
rider_id int,
rider_age int,
rider_first varchar,
rider_last varchar,
rider_gender varchar,
driver_id int,
driver_age int,
driver_first varchar,
driver_last varchar,
driver_gender varchar,
ts timestamp,
lon double,
lat double,
point_id int,
last_point int,
d_date date)
WITH (
external_location= 's3a://varada-rideshare-demo/trips_data',
partitioned_by = ARRAY['d_date'],
format = 'PARQUET')
- Create the riders_campaign table:
CREATE TABLE riders_campaign(
rider_id int,
num_trips int,
segment varchar)
WITH (
external_location= 's3://varada-rideshare-demo/riders_orc',
format = 'ORC')
CREATE TABLE riders_campaign(
rider_id int,
num_trips int,
segment varchar)
WITH (
external_location= 's3a://varada-rideshare-demo/riders_orc',
format = 'ORC')
- Create the places table:
CREATE TABLE places(
lat double,
lon double)
WITH (
external_location= 's3://varada-rideshare-demo/places',
format = 'PARQUET')
CREATE TABLE places(
lat double,
lon double)
WITH (
external_location= 's3a://varada-rideshare-demo/places',
format = 'PARQUET')
- Discover the trips_data table partitions:
call system.sync_partition_metadata ('ride_sharing_dataset', 'trips_data', 'full');
- Ensure the trips_data table holds the expected number of rows (4486328000):
select count(*) from trips_data
- Ensure the riders_campaign table holds the expected number of rows (1501):
select count(*) from riders_campaign
Updated about 1 year ago