Advanced SQL Techniques
Autorka: Esra Soylu
Link pierwotny: https://medium.com/@esrasoylu/advanced-sql-techniques-7016163019eb
Opublikowany: 23.11.2024
In this article, we will focus on advanced SQL techniques. I will try to explain the methods used in SQL techniques in detail, both conceptually and with practical examples. We will perform Advanced SQL work with PostgreSQL. Basic SQL commands will not be covered in the content. If you want to learn about basic SQL queries, you can review the examples here.
PostgreSQL popularity is increasing day by day. If you want to examine the current popularity of database management programs, you can use the website.[1]
1.Database Partitioning
In our tables containing a lot of data, when we want to work with smaller data, increase query performance and search in less data, we can separate a table into logical parts by partitioning. There are 4 types of partitioning that we can choose. Let’s examine the purposes of use of these partitioning types and make sample studies.
- List Partitioning: This is the method we can use when we want to divide the table into separate partitions according to the values in a certain column. It is used in cases where data can be classified such as product name, country, city.
- Range Partitioning: Used to divide the table according to specific dates or numbers.
- Hash Partitioning: When partitioning data, it can be used to distribute the values in a certain column to the partitions in a balanced but random way by passing them through a hash function. It is preferred if the data is wanted to be distributed equally.
- Composite Partitioning : Partitions created by using a combination of the above partitioning methods are called composite partitions.
While carrying out the study, we will create analytical SQL queries with PostreSQL using the Videogames table obtained from the Kaggle database.[3]
When creating a table, the partition column must be specified and the column must be the primary key.
Example 1: Creating table for list partition
CREATE TABLE video_games ( game_id SERIAL, game_name TEXT, Platform TEXT, year_of_release INT, Genre TEXT, Publisher TEXT, NA_Sales NUMERIC(10, 2), EU_Sales NUMERIC(10, 2), JP_Sales NUMERIC(10, 2), Other_Sales NUMERIC(10, 2), Global_Sales NUMERIC(10, 2),
PRIMARY KEY (game_id, Platform)
)
PARTITION BY LIST (Platform);
Creating a List Partition
CREATE TABLE videogames_ps2 PARTITION OF video_games FOR VALUES IN ('PS2');
CREATE TABLE videogames_ps3 PARTITION OF video_games
FOR VALUES IN (‘PS3’);
CREATE TABLE videogames_ps4 PARTITION OF video_games
FOR VALUES IN (‘PS4’);
— Can create a default partition for other platforms.
CREATE TABLE videogames_other PARTITION OF video_games
DEFAULT;
After creating the partition, we can add data. Let’s add external data to the video_games table in PostgreSql. Add data to each table you create with this code. In the examples below, this code is not added as a comment again.
COPY video_games (game_name, platform,year_of_release, genre, publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales) FROM 'C:/Users/Esra SOYLU/Desktop/videogames.csv' DELIMITER ',' CSV HEADER NULL AS 'N/A';
Let’s call the videogames_ps2 table we created as a partition.
Example 2:Range Partition
When we want to create a range partition, PARTION Range (column_name) must be specified when creating the table. After creating the table, let’s create a range partition according to years.
CREATE TABLE video_games ( game_id SERIAL, game_name TEXT, Platform TEXT, year_of_release INT, Genre TEXT, Publisher TEXT, NA_Sales NUMERIC(10, 2), EU_Sales NUMERIC(10, 2), JP_Sales NUMERIC(10, 2), Other_Sales NUMERIC(10, 2), Global_Sales NUMERIC(10, 2), )
PARTITION BY RANGE (Year_of_release);
CREATE TABLE games_1980s PARTITION OF video_games
FOR VALUES FROM (1980) TO (2000);
CREATE TABLE games_2000s PARTITION OF video_games
FOR VALUES FROM (2000) TO (2010);
CREATE TABLE games_2010s PARTITION OF video_games
FOR VALUES FROM (2010) TO (2020);
CREATE TABLE games_2020s PARTITION OF video_games
FOR VALUES FROM (2020) TO (2025);
CREATE TABLE games_default PARTITION OF video_games
DEFAULT;
Let’s add the data to our table and analyze it. We called it with the partition select structure we created for the years 1980 -2000. We can now perform any SQL queries we want via this table.
Let’s create codes to create Hash partition and Compuse partition.
Example 3: Creating a Hash Partition
CREATE TABLE video_games ( game_id SERIAL, game_name TEXT, platform TEXT, year_of_release INT, genre TEXT, publisher TEXT, na_sales NUMERIC(10, 2), eu_sales NUMERIC(10, 2), jp_sales NUMERIC(10, 2), other_sales NUMERIC(10, 2), global_sales NUMERIC(10, 2), PRIMARY KEY (game_id, year_of_release) ) PARTITION BY HASH (year_of_release);
-- 1. partition CREATE TABLE video_games_p1 PARTITION OF video_games FOR VALUES WITH (MODULUS 4, REMAINDER 0);
— 2. partition
CREATE TABLE video_games_p2 PARTITION OF video_games
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
— 3. partition
CREATE TABLE video_games_p3 PARTITION OF video_games
FOR VALUES WITH (MODULUS 4, REMAINDER 2);
— 4. partition
CREATE TABLE video_games_p4 PARTITION OF video_games
FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Example 4: Creating a Compuse Partion
After creating a partition table by year, let’s apply partition by platform. Applying two partitions is called compuse partition.
CREATE TABLE video_games ( game_id SERIAL, game_name TEXT, platform TEXT, year_of_release INT, Genre TEXT, Publisher TEXT, NA_Sales NUMERIC(10, 2), EU_Sales NUMERIC(10, 2), JP_Sales NUMERIC(10, 2), Other_Sales NUMERIC(10, 2), Global_Sales NUMERIC(10, 2) )
PARTITION BY RANGE (year_of_release);
-- Partition for the years 1980-2000 CREATE TABLE video_games_1980s PARTITION OF video_games FOR VALUES FROM (1980) TO (2000) PARTITION BY LIST (platform);
— Partition for the years 2000-2010
CREATE TABLE video_games_2000s PARTITION OF video_games
FOR VALUES FROM (2000) TO (2010) PARTITION BY LIST (platform);
--List division by platform for the years 1980-2000 CREATE TABLE video_games_1980s_wii_ps2 PARTITION OF video_games_1980s FOR VALUES IN ('Wii', 'PS2');
–List division by platform for the years 2000-2010
CREATE TABLE video_games_2000s_ps3_xbox_pc PARTITION OF video_games_2000s
FOR VALUES IN (‘PS3’, ‘XBOX’, ‘PC’);
2) Windowing Functions
2.1.)Using Partition By with SQL Sorting Functions
Let’s consider the frequently used sorting functions with Partition By.
- Row Number: When sorting in a data set, the sorting continues regardless of whether they have equal values.
- Rank: When sorting a data set, if there are values of equal value, the data remains the same number as the ranking number. When a different value arrives, it is increased by the skipped value.
- Dense Rank: When sorting a data set, if there are values of equal value, the data remains the same number as the ranking number. When a different value arrives, the next number is passed in order.
Example: Let’s assign a ranking to the games in each group based on sales amount. Games with equal global_sales will be in the same rank and will be skipped to the next rank. For this we will need to use the RANK() function. You can examine the differences by using the row_number() and dense_rank() functions instead of the RANK() function.
SELECT game_name, platform, year_of_release, global_sales, RANK() OVER (PARTITION BY platform ORDER BY global_sales DESC) AS sales_rank FROM video_games;
2.2.)Value Functions
- LAG: This function, used with time data, is a function that accesses a previous row from another column and allows you to create a new column with this accessed information.
Syntax: LAG (column1_name) OVER (ORDER BY column2_name) AS newcolumn_name
- Lead: It is a function that allows you to create a new column by taking the value from the next row. For example, if we want to create a column about next month’s sales in the list of monthly sales, the Lead function can be used.
Syntax: LAG (column1_name) OVER (ORDER BY column2_name) AS newcolumn_name
SELECT game_name, platform, year_of_release, global_sales, LAG(global_sales, 1) OVER (ORDER BY year_of_release) AS previous_sales, LEAD(global_sales, 1) OVER (ORDER BY year_of_release) AS next_sales FROM video_games;
- First Value: If we want to get the first value from an ordered list, we can use the first_value function. For example, this function can be used when we want to find the first year of game sales.
- Last Value:When we want to get the last value from a sorted column, we can use the last_value function. For example, we can use this function when we want to find the last year of game sales.
SELECT game_name, platform, year_of_release, global_sales, FIRST_VALUE(year_of_release) OVER (PARTITION BY platform ORDER BY year_of_release) AS first_sales, LAST_VALUE(year_of_release) OVER (PARTITION BY platform ORDER BY year_of_release RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_sales FROM video_games;
3)Case Statements
In SQL, the `CASE` statement is used to create conditional expressions to check, return values, or generate new columns through logical validation. In each `CASE` expression, the condition is specified using `WHEN`, and `END` is used to terminate the conditions.
Syntax:
CASE case_value
WHEN condition THEN result1
WHEN condition THEN result2
…
If not the result
END CASE;
SELECT game_name, Platform, NA_Sales, CASE WHEN NA_Sales > 5 THEN 'High' WHEN NA_Sales BETWEEN 2 AND 5 THEN 'Medium' ELSE 'Low' END AS Sales_Level FROM video_games;
Let’s write an SQL query that shows the total sales of the platforms by year.
SELECT platform, SUM(CASE WHEN year_of_release = 2005 THEN NA_Sales ELSE 0 END) AS sales_2005, SUM(CASE WHEN year_of_release = 2006 THEN NA_Sales ELSE 0 END) AS sales_2006, SUM(CASE WHEN year_of_release = 2007 THEN NA_Sales ELSE 0 END) AS sales_2007 FROM video_games GROUP BY platform;
4)Common Table Expressions (CTEs)
CTEs are virtual tables that represent a temporarily named result set. CTEs are defined using the WITH keyword and allow you to create a named, reusable subquery within your SQL statement. There are two types of CTE Structures.
- Simple (Non-Recursive) CTE:These are non-cyclical CTEs. It is used for one-time purposes of preparing and editing data or making a subquery more readable.
Syntax:WITH cte_name AS (SELECT columns FROM table_name WHERE conditions
)
SELECT columns FROM cte_name WHERE conditions;
2.Recursive (Özyinelemeli) CTE: It uses a repetitive structure, starting from a starting point query. It is mostly used in hierarchical data structures or for queries that continue with a sequential structure.
Syntax(Recursive CTE): WITH RECURSIVE cte_name AS (
Anchor Query SELECT columns FROM table_name WHERE conditions
UNION ALL
Recursive Query SELECT columns FROM tablo_name INNER JOIN cte_name ON conditions
)
SELECT * FROM cte_name;
- Anchor Query: Defines the starting point of the recursive query. It runs once and adds the result to the CTE.
- Recursive Query: Runs the recursive query using the results obtained from the previous step.
- UNION ALL: Combines anchor and recursive queries.
Let’s write a Recursive CTE query to find all the ancestors of a person named Frank.
CREATE TABLE parent_of ( parent VARCHAR(50), child VARCHAR(50) );
INSERT INTO parent_of (parent, child) VALUES
(‘Alice’, ‘Carol’),
(‘Bob’, ‘Carol’),
(‘Carol’, ‘George’),
(‘Dave’, ‘Mery’),
(‘Eve’, ‘Mary’),
(‘Mary’, ‘Frank’)
WITH RECURSIVE Ancestor AS (SELECT parent AS p FROM parent_of WHERE child='Frank' UNION ALL SELECT parent FROM Ancestor, parent_of WHERE Ancestor.p = ParentOf.child)
SELECT * FROM Ancestor
5)EXPLAIN: Viewing the Query Plan
While Explain developers run SQL queries, we can provide us with information before or during the execution process of a query, and we can analyze the SQL query we will execute based on the results we obtain from these outputs. Depending on the complexity of the query, it provides information on topics such as the join strategy, the method of extracting data from the tables, and the estimated rows used to execute the query.
Syntax: EXPLAIN sorgu ifadesi
EXPLAIN Select * from video_games where game_id between 1 and 5 order by game_name limit 3;
The EXPLAIN ANALYZE command allows us to obtain information about the time spent executing the query, sorting and joining that cannot be done in memory, and more.
6)Index
When we work with a large database and need to find a piece of data, the machine will search every row until it finds it to get the data we want from the database. If the data you are looking for is towards the end, this query will take a long time to run. An index structure is used to eliminate this problem. Thus, Index data query speed is increased. Index tipleri
Syntax: CREATE INDEX index_name ON table_name (column_name1, column_name2);
CREATE INDEX idx_name ON employees (first_name, last_name);
SELECT * FROM employees WHERE first_name= 'ESRA';
7)Triggers
When we run a query in SQL, it is the code structure that enables another query to run. This code structure is used to perform a specific operation as a result of a change made to the database.
Syntax:
CREATE TRIGGER trigger_name {BEFORE | AFTER}
{INSERT | UPDATE | DELETE}
ON table_name [FOR EACH ROW]
BEGIN
Trigger action commands
END;
- trigger_name: The name of the trigger to be created.
- BEFORE | AFTER: Used to specify whether the trigger will run before or after the data change operation.
- INSERT | UPDATE | DELETE: It is specified for which data change operation the trigger will run.
- ON table_name: Specifies which table the trigger will run on.
- FOR EACH ROW: Used to ensure that the trigger runs for each row.
- BEGIN … END: SQL commands to be executed within the trigger are included in this range.
CREATE OR REPLACE FUNCTION set_created_at() RETURNS TRIGGER AS $$ BEGIN
NEW.created_at := NOW();
RETURN NEW;
END;
CREATE TRIGGER before_insert_videogame
BEFORE INSERT ON video_games
FOR EACH ROW
EXECUTE FUNCTION set_created_at();
INSERT INTO video_games (game_name, platform, year_of_release, genre, publisher, na_sales, eu_sales, jp_sales, other_sales, global_sales) VALUES ('Galactic Conquest', 'PS5', 2024, 'Sci-Fi RPG', 'FutureTech Games', 1.20, 0.95, 0.30, 0.10, 2.55);
8)Temporary Tables
Temporary tables are tables that exist only for a certain transaction or session duration and are automatically deleted when the session ends. Temporary tables are often used to store temporary data, perform intermediate calculation operations, or hold data temporarily between database operations.
Syntax: CREATE TEMPORARY TABLE table_name(column1 type,column2 type, … );
CREATE TEMPORARY TABLE temp_employees ( employee_id INT, first_name VARCHAR(50), last_name VARCHAR(50) );
Since we created a temporary table named temp_employees with Create Tempory, it does not appear in the tables area.
9)Stored Procedures
Stored Procedures is a SQL query group that allows you to use SQL queries without rewriting them.It does not return a result. It performs operations such as updating, deleting or adding to the table.
Syntax: CREATE PROCEDURE procedure_name [ (parameters) ]
AS
BEGIN
SQL commands
END;
CREATE OR REPLACE PROCEDURE insert_game_data( game_name TEXT, platform_name TEXT, release_year INT, game_genre TEXT, game_publisher TEXT, na_sales NUMERIC, eu_sales NUMERIC, jp_sales NUMERIC, other_sales NUMERIC, global_sales NUMERIC ) LANGUAGE SQL BEGIN ATOMIC INSERT INTO video_games ( game_name, platform, year_of_release, genre, publisher, na_sales, eu_sales, jp_sales, other_sales, global_sales ) VALUES ( game_name, platform_name, release_year, game_genre, game_publisher, na_sales, eu_sales, jp_sales, other_sales, global_sales ); END;
Calling Procedure
CALL insert_game_data( 'The Legend of Zelda', 'Nintendo Switch', 2023, 'Adventure', 'Nintendo', 5.0, 3.2, 2.1, 0.5, 10.8 );
10)VIEW
View creation in PostgreSQL simplifies complex queries by collecting the desired SQL tables in a virtual table like a real table. We can create it more simply by using only VIEW structures instead of JOIN structures.
Syntax: CREATE VIEW view_name AS query;
Let’s assume that we have a table named employees, departments and sales in this database and we want to create a view that combines employee details and sales information. We can create a code like the one below.
CREATE VIEW sales_team_info AS SELECT employees.employee_id, employees.first_name, employees.last_name, employees.email, employees.phone_number, departments.department_name, sales.total_sales, sales.region FROM employees INNER JOIN departments ON employees.department_id = departments.department_id INNER JOIN sales ON employees.employee_id = sales.employee_id
WHERE departments.department_name=’Sales’;
11)User Defined Functions(UDFs)
User Defined Functions (UDFs) allow you to extend the built-in functionality of SQL by creating custom functions.
CREATE OR REPLACE FUNCTION calculate_age(date_of_birth DATE) RETURNS INTEGER AS $$ BEGIN RETURN DATE_PART('year', AGE(date_of_birth)); END; $$ LANGUAGE plpgsql;
CREATE TABLE users ( user_id SERIAL PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), date_of_birth DATE );
Let’s add data.
INSERT INTO users (first_name, last_name, date_of_birth) VALUES ('James', 'Johnson', '1990-05-15'), ('Bob', 'Smith', '1985-10-20'), ('Charlie', 'Brown', '2000-07-25');
Let’s view the table. We can see that the values in the Age column are automatically created by the function.
12)SUBQUERY
A subquery is a query nested within another query. A subquery is also known as an inner query or nested query. IN and EXISTS are two important functions that are commonly used in SQL subqueries.
IN returns the values in the queried data, while the EXISTS structure checks whether the desired data exists. If the subquery returns at least one row, it returns TRUE, otherwise it returns FALSE.
Syntax: SELECT column1, column2,.. FROM table_name
WHERE column_name IN (subquery);
SELECT column1, column2, …FROM table_name
WHERE EXISTS (subquery);
Example Code:
SELECT * FROM video_games WHERE game_id IN (SELECT game_id FROM video_games WHERE publisher = 'Nintendo');
SELECT * FROM video_games g1 WHERE EXISTS ( SELECT 1 FROM video_games g2 WHERE g2.publisher = 'Nintendo' AND g2.game_id = g1.game_id ) AND g1.year_of_release < 2020;
13)Working with JSON Data
JSON (JavaScript Object Notation) is a flexible data format that especially supports working with semi-structured data. Since data in JSON format is in a key-value pair structure, it offers an ideal solution for data transfer between web-based applications. You can click here to get detailed information about JSON operators and functions. Let’s do some work on JSON functions.
Creating JSON data
CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_name TEXT, order_details JSON );
INSERT INTO orders (customer_name, order_details)
VALUES
(‘Esra’, ‘{“item”: “Laptop”, “quantity”: 1, “price”: 45000}’),
(‘Esma’, ‘{“item”: “Phone”, “quantity”: 2, “price”: 25000}’);
JSON Operators
Assuming we want to retrieve each customer’s name from the orders table along with the item and price details stored in the JSON-formatted order_details column.
Let’s query the Orders table to retrieve records where the item value in the JSON-formatted order_details column is ‘Laptop’.
SELECT * FROM orders WHERE order_details->>'item' = 'Laptop';
You can click here to learn about JSON functions. Let’s work with a json function example.
row_to_json: This function takes a database row as input and returns a single JSON object containing all columns in the row.
SELECT row_to_json(games) FROM games WHERE gameid between 1 and 5;
14)MVCC (Multi-Version Concurrency Control)
PostgreSQL uses Multiversion Concurrency Control to control access to data in a database. This control system allows multiple transactions to occur in the database at the same time and ensures consistency in the data. MVCC was developed to enable simultaneous read and write operations to be performed consistently and without interfering with each other.
-- Begin the transaction BEGIN;
— Insert a new game record
INSERT INTO games (name, genre, release_of_date, price)
VALUES (‘The Last of Us Part II’, ‘Action’, ‘2020-06-19’, 59.99);
— Update the price of an existing game
UPDATE games
SET price = 49.99
WHERE title = ‘The Last of Us Part II’;
— Intentional mistake to simulate failure
DELETE FROM games WHERE id = 99999;
— If all operations succeed, commit the transaction
COMMIT;
— If something goes wrong, roll back the transaction
ROLLBACK;
15)PostgreSQL Performans Tuning
There are functional features in PostgreSQL to make various settings and configurations to improve database performance. With PostgreSQL, performance tuning operations are applied for purposes such as optimizing resource management, improving query response times, and reducing input/output load.
Resource Management: There are many parameters related to resource management. You can click here to get detailed information about these. Let’s look at a few resource management functions.
- shared_buffers:Sets the amount of memory the database server uses for shared memory buffers.
- work_mem: Used to specify the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files.
- vacuum_cost_delay:Specifies the time in milliseconds after which the process will go to sleep when the cost limit is exceeded.
You can get detailed information about the parameters used in the utility process from the link below.
16)SQL Functions Guide
You can check out the guide I wrote about SQL functions before here.
RESOURCE
- DB-Engines ranking. https://db-engines.com/en/ranking
- Khan, W., Zhang, C., Luo, B., Kumar, T., & Ahmed, E. (2021, November). Robust Partitioning Scheme for Accelerating SQL Database. In 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT) (pp. 369–376). IEEE.
- Migeruj. (2017). Video Games Predictive Model [Data set]. Kaggle. https://www.kaggle.com/datasets/migeruj/videogames-predictive-model
- Mannino, M. (2022). Database Design, Query Formulation, and Administration: Using Oracle and PostgreSQL. Sage Publications.
- PostgreSQL Tutorial. https://www.postgresql.org/docs/current/
- PostgreSQL Performance Tuning. https://sematext.com/blog/postgresql-performance-tuning/
- PostgreSQL Performance Tuning. https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server