When you have your dataset loaded, you can start constructing some queries to discover what your data tells you. In this section, you learn how to write queries that answer these questions:
- How many rides take place each day?
- What is the average fare amount?
- How many rides of each rate type were taken?
- What kind of trips are going to and from airports?
- How many rides took place on New Year's Day 2016?
This dataset contains ride data for January 2016. To find out how many rides
took place each day, you can use a SELECT
statement. In this case, you want to
count the total number of rides each day, and show them in a list by date.
Connect to the Timescale database that contains the NYC taxi dataset.
At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return a count of rides for each day:
SELECT date_trunc('day', pickup_datetime) as day,COUNT(*) FROM ridesWHERE pickup_datetime < '2016-01-08'GROUP BY dayORDER BY day;The result of the query looks like this:
day | count---------------------+--------2016-01-01 00:00:00 | 3450372016-01-02 00:00:00 | 3128312016-01-03 00:00:00 | 3028782016-01-04 00:00:00 | 3161712016-01-05 00:00:00 | 3432512016-01-06 00:00:00 | 3485162016-01-07 00:00:00 | 364894
You can include a function in your SELECT
query to determine the average fare
paid by each passenger.
Connect to the Timescale database that contains the NYC taxi dataset.
At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return the average fare paid on each day:
SELECT date_trunc('day', pickup_datetime)AS day, avg(fare_amount)FROM ridesWHERE pickup_datetime < '2016-01-08'GROUP BY dayORDER BY day;The result of the query looks like this:
day | avg---------------------+---------------------2016-01-01 00:00:00 | 12.85693250289099432016-01-02 00:00:00 | 12.43447135993555632016-01-03 00:00:00 | 13.06159004615719862016-01-04 00:00:00 | 12.20729273083236602016-01-05 00:00:00 | 12.00186708851540132016-01-06 00:00:00 | 12.00023290178930092016-01-07 00:00:00 | 12.1234180337303436
Taxis in New York City use a range of different rate types for different kinds
of trips. For example, trips to the airport are charged at a flat rate from any
location within the city. This section shows you how to construct a query that
shows you the nuber of trips taken for each different fare type. It also uses a
JOIN
statement to present the data in a more informative way.
Connect to the Timescale database that contains the NYC taxi dataset.
At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return the total number of trips taken for each rate code:
SELECT rate_code, COUNT(vendor_id) AS num_tripsFROM ridesWHERE pickup_datetime < '2016-01-08'GROUP BY rate_codeORDER BY rate_code;The result of the query looks like this:
rate_code | num_trips-----------+-----------1 | 22664012 | 548323 | 41264 | 9675 | 71936 | 1799 | 42
This output is correct, but it's not very easy to read, because you probably
don't know what the different rate codes mean. However, the rates
table in the
dataset contains a human-readable description of each code. You can use a JOIN
statement in your query to connect the rides
and rates
tables, and present
information from both in your results.
Connect to the Timescale database that contains the NYC taxi dataset.
At the psql prompt, copy this query to select all rides taken in the first week of January 2016, join the
rides
andrates
tables, and return the total number of trips taken for each rate code, with a description of the rate code:SELECT rates.description, COUNT(vendor_id) AS num_tripsFROM ridesJOIN rates ON rides.rate_code = rates.rate_codeWHERE pickup_datetime < '2016-01-08'GROUP BY rates.descriptionORDER BY LOWER(rates.description);The result of the query looks like this:
description | num_trips-----------------------+-----------group ride | 17JFK | 54832Nassau or Westchester | 967negotiated fare | 7193Newark | 4126standard rate | 2266401
There are two primary airports in the dataset: John F. Kennedy airport, or JFK, is represented by rate code 2; Newark airport, or EWR, is represented by rate code 3.
Information about the trips that are going to and from the two airports is useful for city planning, as well as for organizations like the NYC Tourism Bureau.
This section shows you how to construct a query that returns trip information for trips going only to the new main airports.
Connect to the Timescale database that contains the NYC taxi dataset.
At the psql prompt, use this query to select all rides taken to and from JFK and Newark airports, in the first week of January 2016, and return the number of trips to that airport, the average trip duration, average trip cost, and average number of passengers:
SELECT rates.description,COUNT(vendor_id) AS num_trips,AVG(dropoff_datetime - pickup_datetime) AS avg_trip_duration,AVG(total_amount) AS avg_total,AVG(passenger_count) AS avg_passengersFROM ridesJOIN rates ON rides.rate_code = rates.rate_codeWHERE rides.rate_code IN (2,3) AND pickup_datetime < '2016-01-08'GROUP BY rates.descriptionORDER BY rates.description;The result of the query looks like this:
description | num_trips | avg_trip_duration | avg_total | avg_passengers-------------+-----------+-------------------+---------------------+--------------------JFK | 54832 | 00:46:44.614222 | 63.7791311642836300 | 1.8062080536912752Newark | 4126 | 00:34:45.575618 | 84.3841783809985458 | 1.8979641299079011
New York City is famous for the Ball Drop New Year's Eve celebration in Times Square. Thousands of people gather to bring in the New Year and then head out into the city: to their favorite bar, to gather with friends for a meal, or back home. This section shows you how to construct a query that returns the number of taxi trips taken on 1 January, 2016, in 30 minute intervals.
In PostgreSQL, it's not particularly easy to segment the data by 30 minute time
intervals. To do this, you would need to use a TRUNC
function to calculate the
quotient of the minute that a ride began in divided by 30, then truncate the
result to take the floor of that quotient. When you had that result, you could
multiply the truncated quotient by 30.
In your Timescale database, you can use the time_bucket
function to segment
the data into time intervals instead.
Connect to the Timescale database that contains the NYC taxi dataset.
At the psql prompt, use this query to select all rides taken on the first day of January 2016, and return a count of rides for each 30 minute interval:
SELECT time_bucket('30 minute', pickup_datetime) AS thirty_min, count(*)FROM ridesWHERE pickup_datetime < '2016-01-02 00:00'GROUP BY thirty_minORDER BY thirty_min;The result of the query starts like this:
thirty_min | count---------------------+-------2016-01-01 00:00:00 | 109202016-01-01 00:30:00 | 143502016-01-01 01:00:00 | 146602016-01-01 01:30:00 | 138512016-01-01 02:00:00 | 132602016-01-01 02:30:00 | 122302016-01-01 03:00:00 | 11362
Keywords
Found an issue on this page?Report an issue or Edit this page in GitHub.