Zaawansowane Techniki SQL

Advanced SQL Techniques

Autorka: Esra Soylu

Link pierwotny: https://medium.com/@esrasoylu/advanced-sql-techniques-7016163019eb

Opublikowany: 23.11.2024

 

In this article, we will focus on advanced SQL techniques. I will try to explain the methods used in SQL techniques in detail, both conceptually and with practical examples. We will perform Advanced SQL work with PostgreSQL. Basic SQL commands will not be covered in the content. If you want to learn about basic SQL queries, you can review the examples here.

PostgreSQL popularity is increasing day by day. If you want to examine the current popularity of database management programs, you can use the website.[1]

1.Database Partitioning

In our tables containing a lot of data, when we want to work with smaller data, increase query performance and search in less data, we can separate a table into logical parts by partitioning. There are 4 types of partitioning that we can choose. Let’s examine the purposes of use of these partitioning types and make sample studies.

  • List Partitioning: This is the method we can use when we want to divide the table into separate partitions according to the values ​​in a certain column. It is used in cases where data can be classified such as product name, country, city.
  • Range Partitioning: Used to divide the table according to specific dates or numbers.
  • Hash Partitioning: When partitioning data, it can be used to distribute the values ​​in a certain column to the partitions in a balanced but random way by passing them through a hash function. It is preferred if the data is wanted to be distributed equally.
  • Composite Partitioning : Partitions created by using a combination of the above partitioning methods are called composite partitions.
Robust Partitioning Scheme for Accelerating SQL Database [2]
Robust Partitioning Scheme for Accelerating SQL Database

While carrying out the study, we will create analytical SQL queries with PostreSQL using the Videogames table obtained from the Kaggle database.[3]

When creating a table, the partition column must be specified and the column must be the primary key.

Example 1: Creating table for list partition

CREATE TABLE video_games ( 
    game_id SERIAL, 
    game_name TEXT, 
    Platform TEXT, 
    year_of_release INT, 
    Genre TEXT, 
    Publisher TEXT, 
    NA_Sales NUMERIC(10, 2), 
    EU_Sales NUMERIC(10, 2), 
    JP_Sales NUMERIC(10, 2), 
    Other_Sales NUMERIC(10, 2), 
    Global_Sales NUMERIC(10, 2),

PRIMARY KEY (game_id, Platform)

)
PARTITION BY LIST (Platform);

Creating a List Partition

CREATE TABLE videogames_ps2 PARTITION OF video_games 
    FOR VALUES IN ('PS2');

CREATE TABLE videogames_ps3 PARTITION OF video_games
FOR VALUES IN (‘PS3’);

CREATE TABLE videogames_ps4 PARTITION OF video_games
FOR VALUES IN (‘PS4’);

— Can create a default partition for other platforms.
CREATE TABLE videogames_other PARTITION OF video_games
DEFAULT;

After creating the partition, we can add data. Let’s add external data to the video_games table in PostgreSql. Add data to each table you create with this code. In the examples below, this code is not added as a comment again.

COPY video_games (game_name, platform,year_of_release, genre, publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales) 
FROM 'C:/Users/Esra SOYLU/Desktop/videogames.csv' 
DELIMITER ',' 
CSV HEADER 
NULL AS 'N/A';

Let’s call the videogames_ps2 table we created as a partition.

Example 2:Range Partition

When we want to create a range partition, PARTION Range (column_name) must be specified when creating the table. After creating the table, let’s create a range partition according to years.

CREATE TABLE video_games ( 
    game_id SERIAL, 
    game_name TEXT, 
    Platform TEXT, 
    year_of_release INT, 
    Genre TEXT, 
    Publisher TEXT, 
    NA_Sales NUMERIC(10, 2), 
    EU_Sales NUMERIC(10, 2), 
    JP_Sales NUMERIC(10, 2), 
    Other_Sales NUMERIC(10, 2), 
    Global_Sales NUMERIC(10, 2), 
)

PARTITION BY RANGE (Year_of_release);

CREATE TABLE games_1980s PARTITION OF video_games
FOR VALUES FROM (1980) TO (2000);

CREATE TABLE games_2000s PARTITION OF video_games
FOR VALUES FROM (2000) TO (2010);

CREATE TABLE games_2010s PARTITION OF video_games
FOR VALUES FROM (2010) TO (2020);

CREATE TABLE games_2020s PARTITION OF video_games
FOR VALUES FROM (2020) TO (2025);

CREATE TABLE games_default PARTITION OF video_games
DEFAULT;

Let’s add the data to our table and analyze it. We called it with the partition select structure we created for the years 1980 -2000. We can now perform any SQL queries we want via this table.

Let’s create codes to create Hash partition and Compuse partition.

Example 3: Creating a Hash Partition

CREATE TABLE video_games ( 
    game_id SERIAL, 
    game_name TEXT, 
    platform TEXT, 
    year_of_release INT, 
    genre TEXT, 
    publisher TEXT, 
    na_sales NUMERIC(10, 2), 
    eu_sales NUMERIC(10, 2), 
    jp_sales NUMERIC(10, 2), 
    other_sales NUMERIC(10, 2), 
    global_sales NUMERIC(10, 2), 
    PRIMARY KEY (game_id, year_of_release) 
)  
PARTITION BY HASH (year_of_release);
-- 1. partition 
CREATE TABLE video_games_p1 PARTITION OF video_games 
FOR VALUES WITH (MODULUS 4, REMAINDER 0);

— 2. partition
CREATE TABLE video_games_p2 PARTITION OF video_games
FOR VALUES WITH (MODULUS 4, REMAINDER 1);

— 3. partition
CREATE TABLE video_games_p3 PARTITION OF video_games
FOR VALUES WITH (MODULUS 4, REMAINDER 2);

— 4. partition
CREATE TABLE video_games_p4 PARTITION OF video_games
FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Example 4: Creating a Compuse Partion

After creating a partition table by year, let’s apply partition by platform. Applying two partitions is called compuse partition.

CREATE TABLE video_games ( 
    game_id SERIAL, 
    game_name TEXT, 
    platform TEXT, 
    year_of_release INT, 
    Genre TEXT, 
    Publisher TEXT, 
    NA_Sales NUMERIC(10, 2), 
    EU_Sales NUMERIC(10, 2), 
    JP_Sales NUMERIC(10, 2), 
    Other_Sales NUMERIC(10, 2), 
    Global_Sales NUMERIC(10, 2) 
)

PARTITION BY RANGE (year_of_release);

-- Partition for the years 1980-2000 
CREATE TABLE video_games_1980s PARTITION OF video_games 
    FOR VALUES FROM (1980) TO (2000) PARTITION BY LIST (platform);

— Partition for the years 2000-2010
CREATE TABLE video_games_2000s PARTITION OF video_games
FOR VALUES FROM (2000) TO (2010) PARTITION BY LIST (platform);

--List division by platform for the years 1980-2000 
CREATE TABLE video_games_1980s_wii_ps2 PARTITION OF video_games_1980s 
    FOR VALUES IN ('Wii', 'PS2');

–List division by platform for the years 2000-2010
CREATE TABLE video_games_2000s_ps3_xbox_pc PARTITION OF video_games_2000s
FOR VALUES IN (‘PS3’, ‘XBOX’, ‘PC’);

2) Windowing Functions

2.1.)Using Partition By with SQL Sorting Functions

Let’s consider the frequently used sorting functions with Partition By.

  • Row Number: When sorting in a data set, the sorting continues regardless of whether they have equal values.
  • Rank: When sorting a data set, if there are values ​​of equal value, the data remains the same number as the ranking number. When a different value arrives, it is increased by the skipped value.
  • Dense Rank: When sorting a data set, if there are values ​​of equal value, the data remains the same number as the ranking number. When a different value arrives, the next number is passed in order.

Example: Let’s assign a ranking to the games in each group based on sales amount. Games with equal global_sales will be in the same rank and will be skipped to the next rank. For this we will need to use the RANK() function. You can examine the differences by using the row_number() and dense_rank() functions instead of the RANK() function.

SELECT  
    game_name, 
    platform, 
    year_of_release, 
    global_sales, 
    RANK() OVER (PARTITION BY platform ORDER BY global_sales DESC) AS sales_rank 
FROM  
    video_games;

2.2.)Value Functions

  • LAG: This function, used with time data, is a function that accesses a previous row from another column and allows you to create a new column with this accessed information.

Syntax: LAG (column1_name) OVER (ORDER BY column2_name) AS newcolumn_name

  • Lead: It is a function that allows you to create a new column by taking the value from the next row. For example, if we want to create a column about next month’s sales in the list of monthly sales, the Lead function can be used.

Syntax: LAG (column1_name) OVER (ORDER BY column2_name) AS newcolumn_name

SELECT  
    game_name, 
    platform, 
    year_of_release, 
    global_sales, 
    LAG(global_sales, 1) OVER (ORDER BY year_of_release) AS previous_sales, 
    LEAD(global_sales, 1) OVER (ORDER BY year_of_release) AS next_sales 
FROM  
    video_games;
  • First Value: If we want to get the first value from an ordered list, we can use the first_value function. For example, this function can be used when we want to find the first year of game sales.
  • Last Value:When we want to get the last value from a sorted column, we can use the last_value function. For example, we can use this function when we want to find the last year of game sales.
SELECT  
    game_name, 
    platform, 
    year_of_release, 
    global_sales, 
    FIRST_VALUE(year_of_release) OVER (PARTITION BY platform ORDER BY year_of_release) AS first_sales, 
    LAST_VALUE(year_of_release) OVER (PARTITION BY platform ORDER BY year_of_release RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_sales 
FROM  
    video_games;

3)Case Statements

In SQL, the `CASE` statement is used to create conditional expressions to check, return values, or generate new columns through logical validation. In each `CASE` expression, the condition is specified using `WHEN`, and `END` is used to terminate the conditions.

Syntax:

CASE case_value
WHEN condition
 THEN result1
WHEN condition THEN result2

If not the result
END CASE;

SELECT  
    game_name, 
    Platform, 
    NA_Sales, 
    CASE  
        WHEN NA_Sales > 5 THEN 'High' 
        WHEN NA_Sales BETWEEN 2 AND 5 THEN 'Medium' 
        ELSE 'Low' 
    END AS Sales_Level 
FROM video_games;

Let’s write an SQL query that shows the total sales of the platforms by year.

SELECT  
    platform, 
    SUM(CASE WHEN year_of_release = 2005 THEN NA_Sales ELSE 0 END) AS sales_2005, 
    SUM(CASE WHEN year_of_release = 2006 THEN NA_Sales ELSE 0 END) AS sales_2006, 
    SUM(CASE WHEN year_of_release = 2007 THEN NA_Sales ELSE 0 END) AS sales_2007 
FROM  
    video_games 
GROUP BY  
    platform;

4)Common Table Expressions (CTEs)

CTEs are virtual tables that represent a temporarily named result set. CTEs are defined using the WITH keyword and allow you to create a named, reusable subquery within your SQL statement. There are two types of CTE Structures.

  1. Simple (Non-Recursive) CTE:These are non-cyclical CTEs. It is used for one-time purposes of preparing and editing data or making a subquery more readable.

Syntax:WITH cte_name AS (SELECT columns FROM table_name WHERE conditions
)
SELECT columns FROM cte_name WHERE conditions;

2.Recursive (Özyinelemeli) CTE: It uses a repetitive structure, starting from a starting point query. It is mostly used in hierarchical data structures or for queries that continue with a sequential structure.

Syntax(Recursive CTE): WITH RECURSIVE cte_name AS (
Anchor Query SELECT columns FROM table_name WHERE conditions

UNION ALL

Recursive Query SELECT columns FROM tablo_name INNER JOIN cte_name ON conditions
)

SELECT * FROM cte_name;

  • Anchor Query: Defines the starting point of the recursive query. It runs once and adds the result to the CTE.
  • Recursive Query: Runs the recursive query using the results obtained from the previous step.
  • UNION ALL: Combines anchor and recursive queries.
Using recursion to find the ancestors of a person. | Image: Denis Lukichev

Let’s write a Recursive CTE query to find all the ancestors of a person named Frank.

CREATE TABLE parent_of ( 
    parent VARCHAR(50), 
    child VARCHAR(50) 
);

INSERT INTO parent_of (parent, child) VALUES
(‘Alice’, ‘Carol’),
(‘Bob’, ‘Carol’),
(‘Carol’, ‘George’),
(‘Dave’, ‘Mery’),
(‘Eve’, ‘Mary’),
(‘Mary’, ‘Frank’)

WITH RECURSIVE Ancestor AS (SELECT parent AS p FROM parent_of WHERE child='Frank' 
      UNION ALL 
      SELECT parent FROM Ancestor, parent_of WHERE Ancestor.p = ParentOf.child)

SELECT * FROM Ancestor

Illustration of the results from the recursion to find the ancestors of a person. | Image: Denis Lukichev

5)EXPLAIN: Viewing the Query Plan

While Explain developers run SQL queries, we can provide us with information before or during the execution process of a query, and we can analyze the SQL query we will execute based on the results we obtain from these outputs. Depending on the complexity of the query, it provides information on topics such as the join strategy, the method of extracting data from the tables, and the estimated rows used to execute the query.

Syntax: EXPLAIN sorgu ifadesi

EXPLAIN Select * from video_games where game_id between 1 and 5 order by game_name limit 3;

The EXPLAIN ANALYZE command allows us to obtain information about the time spent executing the query, sorting and joining that cannot be done in memory, and more.

6)Index

When we work with a large database and need to find a piece of data, the machine will search every row until it finds it to get the data we want from the database. If the data you are looking for is towards the end, this query will take a long time to run. An index structure is used to eliminate this problem. Thus, Index data query speed is increased. Index tipleri

Syntax: CREATE INDEX index_name ON table_name (column_name1, column_name2);

CREATE INDEX idx_name 
ON employees (first_name, last_name);
SELECT *  
FROM employees  
WHERE first_name= 'ESRA';

7)Triggers

When we run a query in SQL, it is the code structure that enables another query to run. This code structure is used to perform a specific operation as a result of a change made to the database.

Syntax:

CREATE TRIGGER trigger_name {BEFORE | AFTER}

{INSERT | UPDATE | DELETE}

ON table_name [FOR EACH ROW]

BEGIN

Trigger action commands

END;

  • trigger_name: The name of the trigger to be created.
  • BEFORE | AFTER: Used to specify whether the trigger will run before or after the data change operation.
  • INSERT | UPDATE | DELETE: It is specified for which data change operation the trigger will run.
  • ON table_name: Specifies which table the trigger will run on.
  • FOR EACH ROW: Used to ensure that the trigger runs for each row.
  • BEGIN … END: SQL commands to be executed within the trigger are included in this range.
CREATE OR REPLACE FUNCTION set_created_at() 
RETURNS TRIGGER AS $$ 
BEGIN

NEW.created_at := NOW();
RETURN NEW;
END;

CREATE TRIGGER before_insert_videogame
BEFORE INSERT ON video_games
FOR EACH ROW
EXECUTE FUNCTION set_created_at();

INSERT INTO video_games (game_name, platform, year_of_release, genre, publisher, na_sales, eu_sales, jp_sales, other_sales, global_sales)  
VALUES ('Galactic Conquest', 'PS5', 2024, 'Sci-Fi RPG', 'FutureTech Games', 1.20, 0.95, 0.30, 0.10, 2.55);

8)Temporary Tables

Temporary tables are tables that exist only for a certain transaction or session duration and are automatically deleted when the session ends. Temporary tables are often used to store temporary data, perform intermediate calculation operations, or hold data temporarily between database operations.

Syntax: CREATE TEMPORARY TABLE table_name(column1 type,column2 type, … );

CREATE TEMPORARY TABLE temp_employees ( 
    employee_id INT, 
    first_name VARCHAR(50), 
    last_name VARCHAR(50) 
);

Since we created a temporary table named temp_employees with Create Tempory, it does not appear in the tables area.

9)Stored Procedures

Stored Procedures is a SQL query group that allows you to use SQL queries without rewriting them.It does not return a result. It performs operations such as updating, deleting or adding to the table.

Syntax: CREATE PROCEDURE procedure_name [ (parameters) ]
AS
BEGIN
SQL commands
END;

CREATE OR REPLACE PROCEDURE insert_game_data( 
    game_name TEXT, 
    platform_name TEXT, 
    release_year INT, 
    game_genre TEXT, 
    game_publisher TEXT, 
    na_sales NUMERIC, 
    eu_sales NUMERIC, 
    jp_sales NUMERIC, 
    other_sales NUMERIC, 
    global_sales NUMERIC 
) 
LANGUAGE SQL 
BEGIN ATOMIC 
    INSERT INTO video_games ( 
        game_name, 
        platform, 
        year_of_release, 
        genre, 
        publisher, 
        na_sales, 
        eu_sales, 
        jp_sales, 
        other_sales, 
        global_sales 
    ) 
    VALUES ( 
        game_name, 
        platform_name, 
        release_year, 
        game_genre, 
        game_publisher, 
        na_sales, 
        eu_sales, 
        jp_sales, 
        other_sales, 
        global_sales 
    ); 
END;

Calling Procedure

CALL insert_game_data( 
    'The Legend of Zelda', 
    'Nintendo Switch', 
    2023, 
    'Adventure', 
    'Nintendo', 
    5.0, 
    3.2, 
    2.1, 
    0.5, 
    10.8 
);

10)VIEW

View creation in PostgreSQL simplifies complex queries by collecting the desired SQL tables in a virtual table like a real table. We can create it more simply by using only VIEW structures instead of JOIN structures.

Syntax: CREATE VIEW view_name AS query;

Let’s assume that we have a table named employees, departments and sales in this database and we want to create a view that combines employee details and sales information. We can create a code like the one below.

CREATE VIEW sales_team_info AS 
SELECT 
  employees.employee_id, 
  employees.first_name, 
  employees.last_name, 
  employees.email, 
  employees.phone_number, 
  departments.department_name, 
  sales.total_sales, 
  sales.region 
FROM 
  employees  
  INNER JOIN departments  ON employees.department_id = departments.department_id 
  INNER JOIN sales ON employees.employee_id = sales.employee_id

WHERE departments.department_name=’Sales’;

11)User Defined Functions(UDFs)

User Defined Functions (UDFs) allow you to extend the built-in functionality of SQL by creating custom functions.

CREATE OR REPLACE FUNCTION calculate_age(date_of_birth DATE) 
RETURNS INTEGER AS $$ 
BEGIN 
  RETURN DATE_PART('year', AGE(date_of_birth)); 
END; 
$$ LANGUAGE plpgsql;
CREATE TABLE users ( 
  user_id SERIAL PRIMARY KEY, 
  first_name VARCHAR(50), 
  last_name VARCHAR(50), 
  date_of_birth DATE 
);

Let’s add data.

INSERT INTO users (first_name, last_name, date_of_birth) 
VALUES  
  ('James', 'Johnson', '1990-05-15'), 
  ('Bob', 'Smith', '1985-10-20'), 
  ('Charlie', 'Brown', '2000-07-25');

Let’s view the table. We can see that the values ​​in the Age column are automatically created by the function.

12)SUBQUERY

A subquery is a query nested within another query. A subquery is also known as an inner query or nested query. IN and EXISTS are two important functions that are commonly used in SQL subqueries.

IN returns the values in the queried data, while the EXISTS structure checks whether the desired data exists. If the subquery returns at least one row, it returns TRUE, otherwise it returns FALSE.

Syntax: SELECT column1, column2,.. FROM table_name
WHERE column_name IN (subquery);

SELECT column1, column2, …FROM table_name
WHERE EXISTS (subquery);

Example Code:

SELECT *  
FROM video_games  
WHERE game_id IN (SELECT game_id FROM video_games WHERE publisher = 'Nintendo');
SELECT * 
FROM video_games g1 
WHERE EXISTS ( 
    SELECT 1 
    FROM video_games g2 
    WHERE g2.publisher = 'Nintendo' 
    AND g2.game_id = g1.game_id 
) 
AND g1.year_of_release < 2020;

13)Working with JSON Data

JSON (JavaScript Object Notation) is a flexible data format that especially supports working with semi-structured data. Since data in JSON format is in a key-value pair structure, it offers an ideal solution for data transfer between web-based applications. You can click here to get detailed information about JSON operators and functions. Let’s do some work on JSON functions.

Creating JSON data

CREATE TABLE orders ( 
    id SERIAL PRIMARY KEY, 
    customer_name TEXT, 
    order_details JSON 
);

INSERT INTO orders (customer_name, order_details)
VALUES
(‘Esra’, ‘{“item”: “Laptop”, “quantity”: 1, “price”: 45000}’),
(‘Esma’, ‘{“item”: “Phone”, “quantity”: 2, “price”: 25000}’);

JSON Operators

Assuming we want to retrieve each customer’s name from the orders table along with the item and price details stored in the JSON-formatted order_details column.

Let’s query the Orders table to retrieve records where the item value in the JSON-formatted order_details column is ‘Laptop’.

SELECT *  
FROM orders 
WHERE order_details->>'item' = 'Laptop';

You can click here to learn about JSON functions. Let’s work with a json function example.

row_to_json: This function takes a database row as input and returns a single JSON object containing all columns in the row.

SELECT row_to_json(games) FROM games WHERE gameid between 1 and 5;

14)MVCC (Multi-Version Concurrency Control)

PostgreSQL uses Multiversion Concurrency Control to control access to data in a database. This control system allows multiple transactions to occur in the database at the same time and ensures consistency in the data. MVCC was developed to enable simultaneous read and write operations to be performed consistently and without interfering with each other.

-- Begin the transaction 
BEGIN;

— Insert a new game record
INSERT INTO games (name, genre, release_of_date, price)
VALUES (‘The Last of Us Part II’, ‘Action’, ‘2020-06-19’, 59.99);

— Update the price of an existing game
UPDATE games
SET price = 49.99
WHERE title = ‘The Last of Us Part II’;

— Intentional mistake to simulate failure
DELETE FROM games WHERE id = 99999;

— If all operations succeed, commit the transaction
COMMIT;

— If something goes wrong, roll back the transaction
ROLLBACK;

15)PostgreSQL Performans Tuning

There are functional features in PostgreSQL to make various settings and configurations to improve database performance. With PostgreSQL, performance tuning operations are applied for purposes such as optimizing resource management, improving query response times, and reducing input/output load.

https://dbsguru.com/postgresql-architecture-physical-storage-structure

Resource Management: There are many parameters related to resource management. You can click here to get detailed information about these. Let’s look at a few resource management functions.

  • shared_buffers:Sets the amount of memory the database server uses for shared memory buffers.
  • work_mem: Used to specify the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files.
  • vacuum_cost_delay:Specifies the time in milliseconds after which the process will go to sleep when the cost limit is exceeded.

You can get detailed information about the parameters used in the utility process from the link below.

16)SQL Functions Guide

You can check out the guide I wrote about SQL functions before here.

RESOURCE

  1. DB-Engines rankinghttps://db-engines.com/en/ranking
  2. Khan, W., Zhang, C., Luo, B., Kumar, T., & Ahmed, E. (2021, November). Robust Partitioning Scheme for Accelerating SQL Database. In 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT) (pp. 369–376). IEEE.
  3. Migeruj. (2017). Video Games Predictive Model [Data set]. Kaggle. https://www.kaggle.com/datasets/migeruj/videogames-predictive-model
  4. Mannino, M. (2022). Database Design, Query Formulation, and Administration: Using Oracle and PostgreSQL. Sage Publications.
  5. PostgreSQL Tutorial. https://www.postgresql.org/docs/current/
  6. PostgreSQL Performance Tuning. https://sematext.com/blog/postgresql-performance-tuning/
  7. PostgreSQL Performance Tuning. https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server