Query the data

When you have your dataset loaded, you can start constructing some queries to discover what your data tells you. In this section, you learn how to write queries that answer these questions:

How many rides take place every day?

This dataset contains ride data for January 2016. To find out how many rides took place each day, you can use a SELECT statement. In this case, you want to count the total number of rides each day, and show them in a list by date.

Finding how many rides take place every day

Connect to the Timescale database that contains the NYC taxi dataset.

At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return a count of rides for each day:


SELECT date_trunc('day', pickup_datetime) as day,
COUNT(*) FROM rides
WHERE pickup_datetime < '2016-01-08'
GROUP BY day
ORDER BY day;

The result of the query looks like this:


day         | count
---------------------+--------
 2016-01-01 00:00:00 | 345037
 2016-01-02 00:00:00 | 312831
 2016-01-03 00:00:00 | 302878
 2016-01-04 00:00:00 | 316171
 2016-01-05 00:00:00 | 343251
 2016-01-06 00:00:00 | 348516
 2016-01-07 00:00:00 | 364894

What is the average fare amount?

You can include a function in your SELECT query to determine the average fare paid by each passenger.

Finding the average fare amount

Connect to the Timescale database that contains the NYC taxi dataset.

At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return the average fare paid on each day:


SELECT date_trunc('day', pickup_datetime)
AS day, avg(fare_amount)
FROM rides
WHERE pickup_datetime < '2016-01-08'
GROUP BY day
ORDER BY day;

The result of the query looks like this:


day         |         avg
---------------------+---------------------
 2016-01-01 00:00:00 | 12.8569325028909943
 2016-01-02 00:00:00 | 12.4344713599355563
 2016-01-03 00:00:00 | 13.0615900461571986
 2016-01-04 00:00:00 | 12.2072927308323660
 2016-01-05 00:00:00 | 12.0018670885154013
 2016-01-06 00:00:00 | 12.0002329017893009
 2016-01-07 00:00:00 | 12.1234180337303436

How many rides of each rate type were taken?

Taxis in New York City use a range of different rate types for different kinds of trips. For example, trips to the airport are charged at a flat rate from any location within the city. This section shows you how to construct a query that shows you the nuber of trips taken for each different fare type. It also uses a JOIN statement to present the data in a more informative way.

Finding the number of rides for each fare type

Connect to the Timescale database that contains the NYC taxi dataset.

At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return the total number of trips taken for each rate code:


SELECT rate_code, COUNT(vendor_id) AS num_trips
FROM rides
WHERE pickup_datetime < '2016-01-08'
GROUP BY rate_code
ORDER BY rate_code;

The result of the query looks like this:


rate_code | num_trips
-----------+-----------
         1 |   2266401
         2 |     54832
         3 |      4126
         4 |       967
         5 |      7193
         6 |        17
        99 |        42

This output is correct, but it's not very easy to read, because you probably don't know what the different rate codes mean. However, the rates table in the dataset contains a human-readable description of each code. You can use a JOIN statement in your query to connect the rides and rates tables, and present information from both in your results.

Displaying the number of rides for each fare type

Connect to the Timescale database that contains the NYC taxi dataset.

At the psql prompt, copy this query to select all rides taken in the first week of January 2016, join the rides and rates tables, and return the total number of trips taken for each rate code, with a description of the rate code:


SELECT rates.description, COUNT(vendor_id) AS num_trips
FROM rides
JOIN rates ON rides.rate_code = rates.rate_code
WHERE pickup_datetime < '2016-01-08'
GROUP BY rates.description
ORDER BY LOWER(rates.description);

The result of the query looks like this:


description      | num_trips
-----------------------+-----------
 group ride            |        17
 JFK                   |     54832
 Nassau or Westchester |       967
 negotiated fare       |      7193
 Newark                |      4126
 standard rate         |   2266401

What kind of trips are going to and from airports

There are two primary airports in the dataset: John F. Kennedy airport, or JFK, is represented by rate code 2; Newark airport, or EWR, is represented by rate code 3.

Information about the trips that are going to and from the two airports is useful for city planning, as well as for organizations like the NYC Tourism Bureau.

This section shows you how to construct a query that returns trip information for trips going only to the new main airports.

Finding what kind of trips are going to and from airports

Connect to the Timescale database that contains the NYC taxi dataset.

At the psql prompt, use this query to select all rides taken to and from JFK and Newark airports, in the first week of January 2016, and return the number of trips to that airport, the average trip duration, average trip cost, and average number of passengers:


SELECT rates.description,
    COUNT(vendor_id) AS num_trips,
    AVG(dropoff_datetime - pickup_datetime) AS avg_trip_duration,
    AVG(total_amount) AS avg_total,
    AVG(passenger_count) AS avg_passengers
FROM rides
JOIN rates ON rides.rate_code = rates.rate_code
WHERE rides.rate_code IN (2,3) AND pickup_datetime < '2016-01-08'
GROUP BY rates.description
ORDER BY rates.description;

The result of the query looks like this:


description | num_trips | avg_trip_duration |      avg_total      |   avg_passengers
-------------+-----------+-------------------+---------------------+--------------------
 JFK         |     54832 | 00:46:44.614222   | 63.7791311642836300 | 1.8062080536912752
 Newark      |      4126 | 00:34:45.575618   | 84.3841783809985458 | 1.8979641299079011

How many rides took place on New Year's Day 2016?

New York City is famous for the Ball Drop New Year's Eve celebration in Times Square. Thousands of people gather to bring in the New Year and then head out into the city: to their favorite bar, to gather with friends for a meal, or back home. This section shows you how to construct a query that returns the number of taxi trips taken on 1 January, 2016, in 30 minute intervals.

In PotsgreSQL, it's not particularly easy to segment the data by 30 minute time intervals. To do this, you would need to use a TRUNC function to calculate the quotient of the minute that a ride began in divided by 30, then truncate the result to take the floor of that quotient. When you had that result, you could multiply the truncated quotient by 30.

In your Timescale database, you can use the time_bucket function to segment the data into time intervals instead.

Finding how many rides took place on New Year's Day 2016

Connect to the Timescale database that contains the NYC taxi dataset.

At the psql prompt, use this query to select all rides taken on the first day of January 2016, and return a count of rides for each 30 minute interval:


SELECT time_bucket('30 minute', pickup_datetime) AS thirty_min, count(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY thirty_min
ORDER BY thirty_min;

The result of the query starts like this:


thirty_min      | count
---------------------+-------
 2016-01-01 00:00:00 | 10920
 2016-01-01 00:30:00 | 14350
 2016-01-01 01:00:00 | 14660
 2016-01-01 01:30:00 | 13851
 2016-01-01 02:00:00 | 13260
 2016-01-01 02:30:00 | 12230
 2016-01-01 03:00:00 | 11362