Create the Ride Sharing Tables

Before you get started, ensure that:

  • The Varada cluster is up and running

  • You can access the public Ride Sharing dataset S3 bucket at s3://varada-rideshare-demo

👍

You create the tables for the Ride Sharing dataset in an existing schema or create a new schema. If you are using an existing schema, begin with step 2.

  1. In the Query Editor, create a new trips_data_sample schema:
drop schema if exists trips_data_sample;
create schema trips_data_sample;
use trips_data_sample;
  1. In the Query Editor, create the trips_data table:

❗️

The prefix of the external_location attribute value in the commands to create the tables differs according to whether the cluster is connected to an AWS Glue metastore or to a Hive metastore. Use the appropriate command syntax for your system.

CREATE TABLE trips_data(
tripid int,
t_hour int,
t_min int,
d_weekday int,
t_start_ts timestamp,
fare double,
src_zone int,
dst_zone int,
duration double,
distance double,
rider_id int,
rider_age int,
rider_first varchar,
rider_last varchar,
rider_gender varchar,
driver_id int,
driver_age int,
driver_first varchar,
driver_last varchar,
driver_gender varchar,
ts timestamp,
lon double,
lat double,
point_id int,
last_point int,
d_date date)
WITH (                                                                             
  external_location= 's3://varada-rideshare-demo/trips_data',
  partitioned_by = ARRAY['d_date'],
  format = 'PARQUET')
CREATE TABLE trips_data(
tripid int,
t_hour int,
t_min int,
d_weekday int,
t_start_ts timestamp,
fare double,
src_zone int,
dst_zone int,
duration double,
distance double,
rider_id int,
rider_age int,
rider_first varchar,
rider_last varchar,
rider_gender varchar,
driver_id int,
driver_age int,
driver_first varchar,
driver_last varchar,
driver_gender varchar,
ts timestamp,
lon double,
lat double,
point_id int,
last_point int,
d_date date)
WITH (                                                                             
  external_location= 's3a://varada-rideshare-demo/trips_data',
  partitioned_by = ARRAY['d_date'],
  format = 'PARQUET')
  1. Create the riders_campaign table:
CREATE TABLE riders_campaign(
 rider_id int,
 num_trips int,
 segment varchar)
WITH (                                                                              
   external_location= 's3://varada-rideshare-demo/riders_orc',
   format = 'ORC')
CREATE TABLE riders_campaign(
 rider_id int,
 num_trips int,
 segment varchar)
WITH (                                                                              
   external_location= 's3a://varada-rideshare-demo/riders_orc',
   format = 'ORC')
  1. Discover the trips_data_sample table partitions:
call system.sync_partition_metadata ('trips_data_sample', 'trips_data', 'full');
  1. Ensure the trips_data table holds the expected number of rows (4486328000):
select count(*) from trips_data
  1. Ensure the riders_campaign table holds the expected number of rows (1501):
select count(*) from riders_campaign