As we have seen we have already explored the relational and non-relational databases in Part I.
Here is the link: Part I
Now, it’s time to delve deeper and see how we can use SQL to interact with and manipulate our data within relational databases.
Practice for Mastering SQL
Let’s examine a complex query that employs crucial SQL statements and break it down in order to understand it.
We will try to write a query to find the top account holders in New York who had the highest successful transactions, as well as their login hours and failed transactions.
Table 1: The first table ‘account_login_hours’ contains information about the login hours for each account number.
Table 2: The second table ‘transactions’ contains information about transactions that have been made.
Don’t worry. Let’s start by writing the query without focusing on the details for now.
SELECT bank.account_number, RANK() OVER ( ORDER BY transaction.total_successful_transactions DESC ), bank.login_hours, transaction.total_successful_transactions, transaction.total_failed_transactions FROM ( SELECT account_number, SUM(login_minutes) AS login_hours FROM account_login_hours WHERE year = 2020 AND login_date BETWEEN DATE_SUB( CURRENT_DATE(), 7) AND CURRENT_DATE() AND location = 'New York' GROUP BY account_number, login_date ) AS bank LEFT JOIN ( SELECT transaction_date, account_number, COUNT( DISTINCT CASE WHEN transaction_status = 'successful' THEN transaction_id END ) AS total_successful_transactions, COUNT ( DISTINCT CASE WHEN transaction_status = 'failed' THEN transaction_id END ) AS total_failed_transactions FROM transactions WHERE year = 2020 AND transaction_date BETWEEN DATE_SUB( CURRENT_DATE(), 7 ) AND CURRENT_DATE() AND location = 'New York' GROUP BY transaction_date, account_number ) AS transaction ON transaction.account_number = bank.account_number AND transaction.transaction_date = bank.login_date
The next step involves breaking the query into two chunks.
The first part of the query is a subquery that selects the account_number and sum of login_minutes (aliased as login_hours) from the account_login_hours table for a given year, date range, and location. This subquery is given an alias bank.
The second part of the query is another subquery that selects the transaction_date, account_number, and counts of distinct transaction_id values from the transactions table, grouped by transaction_date and account_number, and filtered by year, date range, and location. This subquery is given an alias transaction.
The outer query then performs a LEFT JOIN on the two subqueries by account_number and transaction_date, and selects several columns: bank.account_number, a ranking value (determined by RANK() OVER (ORDER BY transaction.total_successful_transactions DESC)), bank.login_hours, transaction.total_successful_transactions, and transaction.total_failed_transactions.
The ranking value is computed using a window function, which assigns a rank to each row within a result set based on the values in one or more columns. In this case, the ranking is based on the total_successful_transactions column in descending order.
Finally, the query selects the account_number, rank, login_hours, total_successful_transactions, and total_failed_transactions columns from the final result set.
Now, take a closer look at the functions used in the query.
- SELECT: This function is used to specify the columns that you want to include in the output of your query. In this query, SELECT is used to specify the columns that are being selected from the bank and transaction tables.
- RANK(): This function assigns a rank to each row within a result set, based on the values in one or more columns. The rows are ranked in descending order, with a rank of 1 assigned to the row with the highest value(s). In this query, RANK() is used to rank the rows based on the total_successful_transactions column..
- OVER: This function is used in conjunction with an aggregate function, such as RANK(), to specify the partition of the result set over which the ranking function will be applied. In this query, OVER (ORDER BY transaction.total_successful_transactions DESC) specifies that the ranking function should be applied over the entire result set and that the rows should be ranked in descending order based on the total_successful_transactions column.
- SUM(): This function returns the sum of a set of values.
- COUNT(): This function returns the number of rows in a result set that meets a specified condition. In this query, COUNT() is used to count the number of distinct successful and failed transactions for each account on each day.
- DATE_SUB(): This function subtracts a specified time interval from a date. In this query, DATE_SUB(CURRENT_DATE(), 7) is used to calculate the date 7 days before the current date.
- CURRENT_DATE(): This function returns the current date.
- LEFT JOIN: This type of join combines rows from two or more tables based on a related column between the tables. If a row from the left table does not have a matching row in the right table, NULL values will be displayed for the right table’s columns. In this query, a LEFT JOIN is used to combine rows from the bank and transaction tables based on the account_number and transaction_date columns.
- GROUP BY: This clause is used to group the results of a query by one or more columns. In this query, GROUP BY transaction_date, account_number groups the transaction data by the transaction_date and account_number columns.
- CASE: This function allows you to specify multiple conditions and return a different value for each condition. The syntax for a CASE statement is CASE WHEN condition THEN result ELSE result END. In this query, CASE is used to count only the distinct successful or failed transactions for each account on each day.
SQL is an essential tool for working with relational databases. By learning the various commands and syntax, you can efficiently manage and manipulate your data to suit your needs.
Check out few other blogs here :
An introduction to the central limit theorem and sampling techniques
Follow us on LinkedIn