Query the Ride Sharing Dataset

To get you started, we've prepared queries for you to run against the Ride Sharing dataset on the Varada catalog. These queries include various levels of selectivity, including a full scan query with no filters, a cohort analysis query, a selective query, and a query with a selective join.

Feel free to create queries of your own, or to edit these queries by changing the projected columns, the filtered columns, and the filter values.

Query 1: Full Scan Query

This query looks at all the trips between 23-25 March 2018 and counts the number of unique trips per day.

SELECT extract(day from "d_date") as day,
       extract(month from "d_date") as month,
       count(distinct "tripid") as cnt_trips
FROM
  trips_data_sample.trips_data
WHERE
  "d_date" between date '2018-03-23' and date '2018-03-25'
GROUP BY  1,2

Query 2: Cohort Analysis Query

This query looks at the cohort of 18-19 years old riders who took a ride on Thursdays. The query reports the trip distribution by rider gender.

SELECT "rider_gender",
       count(*) as cnt
FROM trips_data_sample.trips_data
WHERE "rider_age" IN (18,19)
      AND "d_weekday" = 5
GROUP BY  1
order by 2 desc;

Query 3: Selective Project Query

This query looks at all the trips for rider_id 3380311 that took place between 7:00am-10:00am in a specific geographic fence, and reports each trip’s exact route, duration, fare, and more.

SELECT
  "tripid", "rider_id", 
  "d_date", "t_hour",  
  "fare", "duration", "distance",
  "rider_age", "rider_gender",
  "lon", "lat"
FROM
  trips_data_sample.trips_data
WHERE
"rider_id" = 3380311
And lon < -122.4121814
And lon > -122.4284
And lat > 37.7660289 
And lat < 37.7760289 
AND "t_hour" between 7 and 10

Query 4: Selective Join Query

This query joins the trips_data table with the riders_campaign dimension table by rider_id, and looks for riders who had fewer than nine trips.

The query shows the following stats by month:

  • Number of unique trips
  • Number of unique riders
  • Average trip fare
  • Average trip duration
SELECT
  extract(month from fact."ts") as month,
  count(distinct fact."tripid") as cnt_unq_trips ,
  count(distinct fact."rider_id") as cnt_unq_riders,
  round(avg(fact."fare"),2) as avg_fare,
  round(avg(fact."duration")/60,2) as avg_duration_min
FROM
  trips_data_sample.trips_data as fact,
  trips_data_sample.riders_campaign as dim
WHERE
  dim."rider_id" = fact."rider_id" AND
  dim."segment" IN ('Students_Promotion','Churning_Riders_Last3Months') AND
  dim."num_trips" < 9
group by 1
order by 1;

Additional Queries

Query 5

This query gets the latitude, longitude, timestamp, and ID of each of the first 100 points of trip number 271004853.

SELECT
  lat,
  lon,
  ts,
  point_id
FROM
 trips_data_sample.trips_data trips_data
WHERE
  (trips_data.tripid = 271004853)
ORDER BY
  point_id
LIMIT
  100

Query 6

This query counts the unique trips in a specified geographic fence.

SELECT
    COUNT(DISTINCT trips_data.tripid) AS ctd_tripid_ok,
    trips_data.src_zone AS src_zone
FROM
    trips_data_sample.trips_data trips_data
WHERE
    (trips_data.distance >= 0.2341123202347175)
    AND (trips_data.t_hour BETWEEN 6 AND (6 + 3))
    AND (trips_data.lon >= -122.4284)
    AND (trips_data.lon <= -122.4121814 + 0.0007764634342138149)
    AND (trips_data.lat >= 37.7660289)
    AND (trips_data.lat <= 37.7760289 + 0.0002670984564692044)
GROUP BY
    2

Query 7

This query counts the trips with fares higher than 26.8136223835386, and reports the number of unique riders, and the average fare.

SELECT
    COUNT(*) AS num_trips,
    COUNT(DISTINCT rider_id) AS num_riders,
    AVG(fare) AS avg_fare
FROM
    trips_data_sample.trips_data trips_data
WHERE
    fare >= 26.8136223835386
    AND last_point = 1
LIMIT 1

Query 8

This query counts the number of trips made per hour by riders aged between18-22 years old during the weekends.

SELECT
    trips_data.t_hour,
    COUNT(*)
FROM
    trips_data_sample.trips_data trips_data
WHERE
    ( (trips_data.t_hour >= CAST (
    CASE WHEN TRY_CAST((SPLIT_PART('20,24',
    ',',
    1)) AS DOUBLE) >= 0 THEN FLOOR(TRY_CAST((SPLIT_PART('20,24', ',', 1)) AS DOUBLE))
    ELSE CEIL(TRY_CAST((SPLIT_PART('20,24',
    ',',
    1)) AS DOUBLE))
END AS BIGINT))
    AND (trips_data.t_hour <= CAST (
    CASE WHEN TRY_CAST((SPLIT_PART('20,24',
    ',',
    2)) AS DOUBLE) >= 0 THEN FLOOR(TRY_CAST((SPLIT_PART('20,24', ',', 2)) AS DOUBLE))
    ELSE CEIL(TRY_CAST((SPLIT_PART('20,24',
    ',',
    2)) AS DOUBLE))
END AS BIGINT))
    AND (trips_data.d_weekday >= 5)
    AND (trips_data.d_weekday <= 7)
    AND (trips_data.rider_age >= 18)
    AND (trips_data.rider_age < 22)
    AND (trips_data.last_point = 1))
GROUP BY
    1
ORDER BY
    2