Technology

Mastering SQL Interviews in the AI Era: Beyond Basic Queries

Ramy Radad

📅 March 24, 2026 at 01:12 PM⏱️ 57 min read

Mastering SQL Interviews in the AI Era: Beyond Basic Queries

In today's fast-paced tech landscape, few of us are still writing SQL queries entirely from scratch. With the rise of Large Language Models (LLMs) and sophisticated AI tools, the process often involves crafting a prompt, making a few strategic adjustments, and letting AI handle the heavy lifting of syntax generation. The query gets written, the data flows, and the job is done—efficiently. Yet, there’s a noticeable disconnect when it comes to technical interviews. Despite this paradigm shift, many SQL assessments continue to present candidates with a pristine schema, a complex problem statement, and a stark, empty editor. The unspoken expectation? That you'll flawlessly conjure a perfectly optimized query from thin air, unaided, as if this were still the standard operational mode for real-world data tasks.

This traditional interview format, while seemingly resistant to change, remains the gatekeeper for many promising careers. Simply lamenting its anachronistic nature won't alter the outcome. The pertinent question, then, isn't whether this approach is fair, but how to master it without resorting to rote memorization of countless isolated queries. The good news is that these challenges aren't arbitrary. Instead, they crystallize into a manageable collection of recurring patterns. Once you learn to identify these underlying structures—like intelligent grouping, selecting a single record per entity, comparing sequential data points, verifying existence (or non-existence), linking related events, identifying data streaks, or detecting overlapping timeframes—the challenge transforms. It moves beyond recalling specific keywords to understanding the fundamental logical shape of the problem. In a world where AI can effortlessly generate syntax, knowing the "direction" and underlying pattern becomes an interviewer's true measure of your SQL prowess.

What "SQL Patterns" Truly Mean

To be clear, when we discuss SQL patterns, we're not advocating for rote memorization of functions or an exhaustive mental dictionary of keywords. A "pattern" isn't merely ROW_NUMBER, CASE, or EXISTS. Instead, it represents a repeatable cognitive framework for dissecting a problem and intuitively grasping its underlying structure, long before you even consider the specific syntax needed to solve it.

Careful analysis of a problem statement often reveals key "signals." Terms such as 'latest,' 'per user,' 'previous,' 'missing,' or 'at least N' are rarely arbitrary. They typically act as signposts, guiding you toward a particular solution approach, irrespective of the specific database schema or business context.

For instance, 'latest per user' seldom resolves with simple aggregation alone; it almost always implies a ranking operation within distinct groups, followed by the selection of a single row. 'Previous value' has nothing to do with grouping; it's inherently about ordered, sequential comparisons. Similarly, 'users without orders' isn't about performing a join for data retrieval; it's about verifying the presence or absence of a related record.

This is the essence of pattern recognition. The vast majority of SQL interview questions are built upon a relatively compact set of these structural archetypes. While real-world scenarios and more complex interview problems might combine two or more patterns—say, aggregation coupled with an existence check, or ranking interwoven with conditional logic—once you master these foundational shapes, tackling their combinations becomes significantly less daunting.

Before we dive into the patterns, a quick note on style: throughout this guide, we'll employ Common Table Expressions (CTEs), using the WITH ... AS syntax. If you're unfamiliar, a CTE is essentially a named, temporary result set that you define at the beginning of your query and can then reference like a regular table. It doesn't alter the query's core logic but dramatically enhances readability by breaking down complex operations into digestible, named steps, preventing deeply nested subqueries. Now, let's explore these fundamental patterns.

1. Aggregation

Aggregation is arguably the most ubiquitous SQL pattern you'll encounter, often deceptively simple in appearance. At its heart, this pattern involves consolidating multiple rows into a single summary result per defined entity, allowing you to make decisions or draw conclusions at that higher level.

The linguistic cues are usually straightforward. Phrases like "for each user," "per department," "per product," or any question asking "how many" or "how much" almost always point in this direction. If you can rephrase the problem as "for each X, calculate Y," you're most likely dealing with aggregation.

While the specific context—counting reports, summing orders, averaging transactions—varies from task to task, the core methodology remains constant: define your grouping key, compute an aggregate (using functions like COUNT, SUM, AVG, etc.), and potentially filter the aggregated results using HAVING. Remember, HAVING filters group results, distinct from WHERE which filters individual rows before grouping.

Consider a classic example from the LeetCode Top SQL 50: Managers with at Least 5 Direct Reports.

This problem asks for managers who oversee five or more direct reports. The key signal is "at least 5 direct reports"—a count with a threshold, indicating aggregation with a HAVING clause. The grouping key is managerId, as you need to count employees associated with each unique manager.

SELECT e2.name
FROM (
    SELECT managerId, COUNT(*) AS c
    FROM Employee
    GROUP BY managerId
    HAVING c >= 5
) res
INNER JOIN Employee e2
    ON res.managerId = e2.id;

Here, the subquery groups employees by their manager, tallies each group, retains only those managers with five or more reports, and then the outer join retrieves the manager's name. It's a clear sequence: group, count, filter, then look up the necessary details.

Common pitfalls include attempting to return columns not part of the grouping key, using WHERE when HAVING is required, or incorrectly identifying the grouping key, which usually means misinterpreting the "per entity" requirement.

For additional practice, try Find Followers Count and Number of Unique Subjects Taught by Each Teacher—both are straightforward single-table aggregations. For a more challenging variant, Customers Who Bought All Products uses the same GROUP BY + HAVING structure but compares the aggregated count against a subquery-derived total to ensure a customer bought every product, not just a fixed number.

2. Conditional Aggregation

Conditional aggregation becomes essential when standard aggregation falls short, particularly when you need to segment your results based on specific conditions without fragmenting the query itself. Instead of running separate queries for different categories and merging them, this pattern lets you compute multiple distinct metrics in a single pass. This is also a common technique for pivoting rows into columns in database systems that lack a dedicated PIVOT operator.

Signals for this pattern are tasks that demand several counts or sums displayed side-by-side, typically broken down by status, type, or category. Examples include: "Count approved vs. rejected items," "total revenue from domestic vs. international sales," or "number of completed and cancelled orders per user." Anytime you see a problem asking for various metrics from the same table, grouped identically but filtered differently, you're likely facing conditional aggregation.

The core technique involves embedding CASE WHEN expressions directly within your aggregate functions. Instead of filtering out rows entirely, you instruct each aggregate function to only consider rows that meet its specific criteria. For instance, SUM(CASE WHEN state = 'approved' THEN 1 ELSE 0 END) will count only approved rows, while a standard COUNT(*) in the same query would count all rows. This allows for a single grouping pass, applying different conditions to yield diverse metrics.

A prime example from the LeetCode Top SQL 50 list is Monthly Transactions I.

The problem asks for each month and country: the total transaction count, the approved transaction count, the total transaction amount, and the approved transaction amount. The key signal is the need for both "all transactions" and "only approved" metrics within the same result set. This dual filtering requirement on the same grouping is a textbook case for conditional aggregation.

SELECT
    DATE_FORMAT(trans_date, '%Y-%m') AS month,
    country,
    COUNT(*) AS trans_count,
    SUM(CASE WHEN state = 'approved' THEN 1 ELSE 0 END) AS approved_count,
    SUM(amount) AS trans_total_amount,
    SUM(CASE WHEN state = 'approved' THEN amount ELSE 0 END) AS approved_total_amount
FROM Transactions
GROUP BY month, country;

Here, you group by month and country, then utilize unconditional aggregates for totals and conditional aggregates for the approved subset. This achieves all four metrics efficiently with a single query scan.

Watch out for common errors such as crafting separate queries or subqueries for each metric when a single CASE WHEN inside the aggregate would suffice, omitting the crucial ELSE 0 (which can introduce NULLs that silently break your sums), and confusing this pattern with WHERE filtering, which would inadvertently remove rows still needed for total counts.

For more practice, explore Queries Quality and Percentage and Count Salary Categories, both of which apply conditional logic within aggregates on a single table.

3. Top N Per Group (Ranking)

This pattern is frequently conflated with aggregation, but their distinction is critical. While aggregation collapses a group into a single summary value, ranking preserves the individual rows within each group, allowing you to select specific ones based on their position within that ordered group.

Look for signals like "latest," "highest," "most recent," or "top 3," always paired with a "per entity" qualifier such as "per user," "per department," or "per category." The definitive clue is that the problem explicitly asks for full rows of data, not merely a count or sum. If the request is for "the highest salary per department," they want the employee's details, not just the salary figure. That's ranking, not aggregation.

The primary tool for this is window functions: ROW_NUMBER, RANK, or DENSE_RANK. You'll partition by the grouping key and order by the criteria that define "top." The ranking is computed in a subquery or CTE, then filtered in the outer query (e.g., WHERE rank_column <= N). The choice between ROW_NUMBER, RANK, and DENSE_RANK hinges on how ties should be handled. ROW_NUMBER assigns a unique rank, even for ties. RANK assigns the same rank to ties but skips the next number (1, 1, 3). DENSE_RANK assigns the same rank to ties but does not skip the next number (1, 1, 2), which is often what's desired when a problem asks for "top N distinct values."

A compelling example is Department Top Three Salaries from the LeetCode Top SQL 50.

The task requires identifying employees who earn one of the top three unique salaries within their respective departments. The signal is "top three" + "in each department," and the output demands employee names and salaries—a clear indicator of ranking. Given the emphasis on "top three unique salaries," DENSE_RANK is the appropriate choice to handle salary ties correctly.

WITH RankedSalaries AS (
    SELECT
        e.name AS employee,
        e.salary,
        e.departmentId,
        DENSE_RANK() OVER (
            PARTITION BY e.departmentId
            ORDER BY e.salary DESC
        ) AS salary_rank
    FROM Employee e
)
SELECT
    d.name AS Department,
    r.employee,
    r.salary
FROM RankedSalaries r
JOIN Department d
    ON r.departmentId = d.id
WHERE r.salary_rank <= 3;

The CTE assigns a DENSE_RANK to each employee within their department based on salary in descending order. The outer query then filters for employees with a salary_rank of 3 or less and joins with the Department table to retrieve department names.

Be mindful of common errors: using ROW_NUMBER when the problem requires preserving ties, selecting RANK instead of DENSE_RANK when "top 3" refers to three distinct values rather than three positional ranks, and forgetting that window functions must be computed in a subquery or CTE before filtering can be applied to their results.

For additional practice, consider Product Sales Analysis III, which asks for the first year of sales per product—solvable with either a simple MIN subquery or RANK() OVER (PARTITION BY product_id ORDER BY year), offering a great comparison of approaches. For a tougher challenge, FAANG Stock Min-Max combines ranking with aggregation, requiring you to compute monthly prices and then rank twice to find the highest and lowest per ticker.

4. Sequential Analysis (LAG / LEAD)

This pattern emerges whenever a problem requires comparing a row with its immediate predecessor or successor within a specific ordered sequence. This isn't about different groups or aggregates, but about analyzing adjacent records.

Signals include terms like "previous," "next," "change," "difference," or "consecutive"—anything implying an ordered comparison. Time-series data is the most frequent context: "compared to yesterday," "change from last month," or "three consecutive days." However, it applies equally to any ordered sequence, such as consecutive IDs or ranked entries.

The solution typically involves LAG and LEAD window functions. LAG retrieves a value from a preceding row, while LEAD fetches a value from a subsequent row. You define the sequence using ORDER BY within the window function, then you can compare, subtract, or apply conditions between the current row and its neighbor. The result is usually placed in a subquery or CTE because you need to compute the shifted value first, then filter or use it in further calculations.

A prime example is Consecutive Numbers from the LeetCode Top SQL 50.

The task is to identify all numbers appearing at least three times consecutively. The signal "consecutive" indicates a need to examine neighbors in sequence, rather than simply counting global occurrences. This isn't aggregation; it requires checking if a row's value matches both its preceding and succeeding values.

WITH cte AS (
    SELECT
        num,
        LEAD(num) OVER (ORDER BY id) AS next,
        LAG(num) OVER (ORDER BY id) AS prev
    FROM Logs
)
SELECT DISTINCT num AS ConsecutiveNums
FROM cte
WHERE num = next AND num = prev;

The CTE enriches each row with two new columns: the value of the next row and the value of the previous row. The outer query then filters to keep only those rows where the current, previous, and next values are all identical, signifying at least three consecutive occurrences. DISTINCT is used to handle scenarios where a number appears consecutively more than three times.

Common mistakes include forgetting that LAG and LEAD return NULL for the first and last rows, respectively (potentially breaking comparisons if not explicitly handled), ordering by an incorrect column (the ordering must reflect the actual sequence), and attempting to solve sequential problems with self-joins when a window function offers a cleaner, more readable solution.

For more practice, try Rising Temperature, which asks you to find days where the temperature was higher than the previous day. This simpler version of the pattern is excellent for comparing LAG with a self-join. For a harder problem, Repeated Payments uses LAG partitioned by three columns to detect duplicate credit card charges within a tight 10-minute window.

5. Event Pairing

This is a particularly insightful pattern often favored in interviews, as it probes a candidate's ability to recognize that not all table rows are independent records. Sometimes, two distinct rows represent two halves of a single, overarching event, and the true insight emerges only when these halves are stitched together.

Signals for this pattern typically involve tables where each entity has multiple rows corresponding to status or event types: "start" and "stop," "open" and "close," "login" and "logout." The problem then asks for a duration, a gap, or a total time span. Whenever you encounter a table logging state changes and the question concerns the time difference between these states, you're looking at event pairing. While it employs the same underlying tool as sequential analysis (Chapter 4), it addresses a fundamentally different problem: combining neighbors into a single, cohesive record.

The primary technique is the LEAD (or occasionally LAG) window function. For each "start" row, you use LEAD to fetch the timestamp of the corresponding "stop" row. By partitioning by the entity and ordering by time, you effectively create these pairs.

A great illustration is Average Time of Process per Machine.

The task is to calculate the average processing time per machine, where each process is defined by distinct "start" and "end" rows. The signal is a table with an activity_type column, indicating either "start" or "end," and the request for the time duration between them. Two rows, one complete event.

WITH paired AS (
    SELECT
        machine_id,
        activity_type,
        timestamp,
        LEAD(timestamp) OVER (
            PARTITION BY machine_id, process_id
            ORDER BY timestamp
        ) AS end_time
    FROM Activity
)
SELECT
    machine_id,
    ROUND(AVG(end_time - timestamp), 3) AS processing_time
FROM paired
WHERE activity_type = 'start'
GROUP BY machine_id;

The CTE uses LEAD to attach the subsequent timestamp to each row, scoped within the same machine and process. The outer query then filters for only "start" activity types, ensuring that end_time accurately represents the matching stop event. Finally, it averages the calculated duration per machine.

While this problem could also be solved with conditional aggregation (e.g., MAX(CASE WHEN 'start' ...) and MAX(CASE WHEN 'end' ...) grouped by process), which is effective when start/stop events share an explicit key, LEAD offers a more generalized solution, especially when events merely alternate in sequence without a direct linking key.

Key mistakes include assuming perfect start/stop alternation (real-world data often has gaps or duplicates), neglecting to filter for only "start" rows after using LEAD (which would incorrectly pair a stop with the next start), and using an incorrect partition, leading to events from different entities being mistakenly paired.

6. Gaps and Islands

This pattern addresses the challenge of identifying "streaks" or consecutive runs of rows that share a particular property, where the grouping isn't explicitly defined within the table. You must derive these unbroken chains from the sequence of the data itself.

Signals for this pattern include words like "consecutive," "streak," "in a row," "continuous," or "uninterrupted." You're given ordered data and asked to pinpoint groups of rows forming an unbroken sequence: consecutive days of activity, consecutive years of filing, or consecutive months of subscription. The difference from sequential analysis (Chapter 4) is that you're not just comparing neighbors; you're identifying and measuring or filtering entire groups of consecutive rows.

The classic approach, initially appearing as a trick but quickly becoming intuitive, relies on a clever observation: if you have a sequence of consecutive values (like dates or years) and you subtract a ROW_NUMBER (partitioned and ordered appropriately) from each value, all rows belonging to the same consecutive run will yield an identical result. This constant value then serves as your unique group identifier.

Consider this: if a user made purchases on days 5, 6, 7, 10, 11. The ROW_NUMBER within that user, ordered by day, would be 1, 2, 3, 4, 5. Subtracting ROW_NUMBER from the day: 5-1=4, 6-2=4, 7-3=4, 10-4=6, 11-5=6. The first streak (days 5, 6, 7) yields a group value of 4, while the second streak (days 10, 11) yields a group value of 6. The specific numbers are less important than the fact that consecutive rows generate the same group identifier.

A strong example is User Shopping Sprees.

The problem asks to find users who made purchases on 3 or more consecutive days. The signal "3 or more consecutive days" perfectly matches island detection: locate streaks of successive dates and then assess their length.

WITH daily_purchases AS (
    SELECT DISTINCT
        user_id,
        transaction_date::date AS purchase_date
    FROM transactions
),
islands AS (
    SELECT
        user_id,
        purchase_date,
        purchase_date - ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY purchase_date
        )::int AS grp
    FROM daily_purchases
)
SELECT DISTINCT user_id
FROM islands
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id;

The first CTE ensures unique user-day entries, critical since multiple purchases on one day should still count as one day of activity. The second CTE applies the purchase_date - ROW_NUMBER() trick to assign a consistent group identifier to consecutive dates within each user's activity. The final query then groups by user and this derived group ID, filters for streaks of three or more days, and returns the unique user IDs.

Common mistakes: failing to deduplicate before applying ROW_NUMBER (which would break the consecutive subtraction logic if multiple events occur on the same day), and not properly partitioning ROW_NUMBER by the correct entity (which would incorrectly merge streaks from different users).

For more practice, try Consecutive Filing Years on DataLemur, which uses the same pattern but with years and an added product filter.

7. Deduplication (Picking One Row)

This pattern often overlaps with ranking (Chapter 3), but its primary objective is different. While ranking selects the top N entries from a group, deduplication aims to cleanse data that shouldn't contain multiple rows for the same logical entity in the first place.

Signals include terms like "duplicate," "remove," "keep only one," "unique per user," or "latest record." Sometimes the problem explicitly calls for deletion; other times, it asks for a result set as if duplicates didn't exist. In either case, the core task is identical: define what constitutes a "duplicate" group, establish the criteria for which row survives, and eliminate the rest.

The go-to approach is ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...). You partition by the columns that define your duplicate group (e.g., email for duplicate emails), order by the criteria that determine the "winner" (e.g., id ascending for the smallest ID, or timestamp DESC for the latest record), and then either select rows where rn = 1 (for querying) or delete rows where rn > 1 (for data cleansing). The underlying mechanism is versatile for both retrieval and modification.

An excellent example is Delete Duplicate Emails from the LeetCode Top SQL 50.

The problem instructs you to delete all duplicate emails, preserving only the row with the smallest ID for each unique email address. The signal is explicit: "delete duplicate." The grouping key is email, and the tie-breaker for which row to keep is the smallest id.

DELETE FROM Person
WHERE id NOT IN (
    SELECT id FROM (
        SELECT
            id,
            ROW_NUMBER() OVER (
                PARTITION BY email
                ORDER BY id
            ) AS rn
        FROM Person
    ) ranked
    WHERE rn = 1
);

The innermost subquery partitions by email, orders by id, and assigns a row number. The middle subquery then selects only those ids where rn = 1, representing the smallest id for each unique email. The outer DELETE statement then removes all other rows.

While a self-join could also solve this (e.g., DELETE p FROM Person p JOIN Person q ON p.email = q.email AND p.id > q.id), the ROW_NUMBER approach is more generalized and robust. If the requirement shifts from "smallest ID" to "most recent timestamp," you only need to adjust the ORDER BY clause. If it changes from a DELETE to a SELECT, you simply remove the outer DELETE layer.

Common pitfalls include blindly using DISTINCT when the problem demands a specific row be kept (DISTINCT deduplicates the output but offers no control over which specific row's data is retained), failing to define the ordering criteria, which leads to arbitrary row retention, and overlooking tie-breaking rules when multiple rows are identical on the ordering column.

8. Existence / Anti-Existence

This pattern focuses on determining whether a corresponding record exists, or does not exist, in a related table. The key here is that you're not joining to retrieve data from the second table; you're joining to answer a binary "yes/no" question about its presence.

Signals typically involve phrases like "users without orders," "never purchased," "did not make," "at least one," or "has no matching." Whenever the problem requires filtering one table based on the mere presence or absence of rows in another table, you're dealing with existence logic.

There are two primary approaches. The first uses EXISTS / NOT EXISTS with a correlated subquery, which is often more concise and performant for complex conditions. The second, and often more intuitive for those comfortable with joins, is a LEFT JOIN followed by a WHERE join_column IS NULL clause. Both achieve the same outcome: filtering the primary table based on the existence or non-existence of a match in the secondary table.

A good example is Customer Who Visited but Did Not Make Any Transactions.

The task is to identify customers who visited but failed to make any transactions during that visit, and then count how many times this occurred. The signal, "visited but did not make any transactions," is classic anti-existence: you need to find visits where no corresponding transaction exists.

SELECT customer_id, COUNT(*) AS count_no_trans
FROM Visits v
LEFT JOIN Transactions t
    ON t.visit_id = v.visit_id
WHERE transaction_id IS NULL
GROUP BY customer_id;

The LEFT JOIN retains all visits, regardless of whether a transaction occurred. The WHERE transaction_id IS NULL clause then isolates only those visits that lacked a matching transaction. Finally, the results are grouped by customer ID and counted. The same result could also be achieved using NOT EXISTS:

SELECT customer_id, COUNT(*) AS count_no_trans
FROM Visits v
WHERE NOT EXISTS (
    SELECT 1 FROM Transactions t
    WHERE t.visit_id = v.visit_id
)
GROUP BY customer_id;

Both approaches are valid; choose the one that feels most natural and readable to you.

Be wary of common mistakes: using INNER JOIN when anti-existence is required (which would eliminate the very rows you're trying to find), employing NOT IN with a subquery that might return NULLs (a subtle bug where NOT IN with a NULL value returns an empty set), and misinterpreting the problem as needing a regular join when no columns from the secondary table are actually required in the output.

9. Self-Join / Pairwise Comparison

This pattern is applicable when all the necessary data resides within a single table, and the task is to compare rows from that table against each other. There's no external table to join to; the relationships are inherent within the dataset itself.

The signals are problems where entities in a table reference other entities within the same table: employees and their managers, friends and friend requests, or records that require comparison with other records from the same source. Any scenario where the problem asks to "find pairs," "compare with other rows in the same table," or where a column like manager_id or reports_to links back to the primary key of the same table.

The technique involves joining the table to itself using distinct aliases. You treat one alias as the "main" row and the other as the "related" row, with the join condition defining their relationship. Precision in this condition is paramount, as a loosely defined self-join can easily lead to duplicate pairs or a Cartesian explosion.

While seemingly similar to sequential analysis (Chapter 4) in its comparison of rows within the same table, the distinction lies in the type of relationship. LAG/LEAD works for ordered, sequential neighbors. A self-join is used when rows are connected by a key or condition that isn't dependent on their physical order. For example, you can't find "each employee's manager" with LAG because there's no inherent ordering where the next row happens to be their manager. The connection is hierarchical, not positional, making a self-join the necessary approach.

A compelling example is The Number of Employees Which Report to Each Employee.

The task is to determine, for each manager, the count of their direct reports and the average age of those employees. The signal is the reports_to column pointing back to employee_id within the same Employees table. One table, two conceptual roles: the employee and their manager.

SELECT
    m.employee_id,
    m.name,
    COUNT(e.employee_id) AS reports_count,
    ROUND(AVG(e.age)) AS average_age
FROM Employees e
JOIN Employees m
    ON e.reports_to = m.employee_id
GROUP BY m.employee_id, m.name
ORDER BY m.employee_id;

One alias of the table (e) represents individual employees, while the other (m) represents their managers. The JOIN connects each employee to their manager via the reports_to foreign key, establishing a hierarchical relationship. From there, it's a matter of aggregation: counting reports, averaging their ages, and grouping by the manager's details.

Common mistakes include forgetting to use distinct aliases for the table (which renders the query ambiguous), applying the wrong join direction (resulting in employees without reports rather than managers with reports), and failing to account for rows matching themselves when the join condition permits, which can inflate counts or create spurious pairs.

For more practice, try Rising Temperature, which self-joins the Weather table to compare each day's temperature with the previous day, demonstrating a self-join for sequential comparison rather than hierarchical relationships.

10. Cartesian Expansion (CROSS JOIN)

This pattern is invoked when the required data doesn't fully exist in your source tables. Your table might log actual events, but the problem demands a report on every conceivable combination, including those where nothing happened. Essentially, you need to construct the complete matrix first, then populate it with existing actuals.

Signals are problems expecting output rows for combinations that currently lack data. For instance, "Show attendance for every student in every subject" means you need a row for students who never took a particular subject. "Revenue per product per month" implies rows for months with zero sales. Whenever the desired output necessitates including missing combinations with zeros or NULLs, you're looking at a CROSS JOIN.

The approach involves generating all possible combinations initially with a CROSS JOIN, then using a LEFT JOIN to merge in the actual data where it exists. The CROSS JOIN lays the foundational structure (the "skeleton"), the LEFT JOIN adds the relevant data (the "flesh"), and any unmatched combinations remain as zeros or NULLs, as required.

A good example is Students and Examinations.

The task is to report how many times each student attended each exam. The crucial detail is that every student should be paired with every subject, even if they never took that subject. The signal involves three tables (Students, Subjects, Examinations) where the output must include every student × subject permutation, not just those with existing exam records.

SELECT
    s.student_id,
    s.student_name,
    sub.subject_name,
    COUNT(e.subject_name) AS attended_exams
FROM Students s
CROSS JOIN Subjects sub
LEFT JOIN Examinations e
    ON s.student_id = e.student_id
    AND sub.subject_name = e.subject_name
GROUP BY s.student_id, s.student_name, sub.subject_name
ORDER BY s.student_id, sub.subject_name;

The CROSS JOIN between Students and Subjects generates every possible student-subject combination. The subsequent LEFT JOIN to Examinations attaches actual attendance records where they exist. Using COUNT(e.subject_name) ensures that only non-NULL matches are counted, correctly yielding zero for students who never took a specific subject. Without the CROSS JOIN, these zero-count rows would simply be absent from the output.

Common mistakes include not recognizing the need for a CROSS JOIN (often leading to attempts with only LEFT JOIN, missing zero-data combinations), underestimating the potential for an exponential increase in row counts when cross-joining large tables, and using COUNT(*) instead of COUNT(column_from_left_joined_table), which would incorrectly count NULL rows as 1 instead of 0.

For more practice, try 3-Topping Pizzas on DataLemur, which cross-joins a table to itself three times to generate all possible 3-topping combinations and uses < comparisons to eliminate duplicates and enforce alphabetical order in one step.

11. Interval Overlap

This pattern arises when you're working with data representing ranges, most commonly time ranges, and the goal is to identify where these ranges intersect or conflict. It's a surprisingly frequent scenario in interviews, especially for companies involved in scheduling, bookings, or resource allocation.

The signals are problems that mention "overlapping," "conflicting," "at the same time," "double booked," or "concurrent." Whenever entities have a defined start and end point, and the question asks if any of them collide, you're dealing with interval overlap.

The solution hinges on a specific condition worth committing to memory: two intervals [a_start, a_end] and [b_start, b_end] overlap if and only if a_start < b_end AND b_start < a_end. An easier way to conceptualize this is by considering the negation: two intervals do NOT overlap if one ends before the other begins (a_end <= b_start or b_end <= a_start). The overlap condition is simply the logical inverse of this non-overlap condition.

Let's illustrate with an example. Imagine a meeting_rooms table where each row signifies a booking:

meeting_rooms
+------------+------------+---------------------+---------------------+
| booking_id | room_id | start_time | end_time |
+------------+------------+---------------------+---------------------+
| 1 | A | 2024-03-01 09:00 | 2024-03-01 10:00 |
| 2 | A | 2024-03-01 09:30 | 2024-03-01 10:30 |
| 3 | A | 2024-03-01 11:00 | 2024-03-01 12:00 |
| 4 | B | 2024-03-01 09:00 | 2024-03-01 10:30 |
| 5 | B | 2024-03-01 10:00 | 2024-03-01 11:00 |
+------------+------------+---------------------+---------------------+

The task: identify all pairs of bookings that conflict, meaning they are in the same room and their time ranges overlap.

SELECT
    a.booking_id AS booking_1,
    b.booking_id AS booking_2,
    a.room_id
FROM meeting_rooms a
JOIN meeting_rooms b
    ON a.room_id = b.room_id
    AND a.booking_id < b.booking_id
    AND a.start_time < b.end_time
    AND b.start_time < a.end_time;

This self-join pairs every two bookings that share the same room. The condition a.booking_id < b.booking_id is crucial for preventing duplicate pairs (e.g., A-B and B-A) and avoiding a booking being paired with itself. The final two conditions embody the interval overlap logic: booking A starts before booking B ends, AND booking B starts before booking A ends.

For room A, bookings 1 and 2 overlap (9:00–10:00 and 9:30–10:30). Booking 3 (starting at 11:00) does not conflict with either. For room B, bookings 4 and 5 overlap (9:00–10:30 and 10:00–11:00).

Common errors include incorrectly formulating the overlap condition (e.g., using a_start BETWEEN b_start AND b_end, which misses scenarios where A fully encompasses B), forgetting the a.booking_id < b.booking_id clause leading to redundant or self-matching pairs, and mistakenly using <= instead of < if the business rule states that a meeting ending precisely when another begins does not constitute a conflict.

To practice this pattern on realistic problems, consider User Concurrent Sessions and Merge Overlapping Events in the Same Hall, both excellent exercises in interval overlap (though they may require a paid subscription).

12. Running Totals / Cumulative Metrics

This pattern involves progressively accumulating values as you traverse an ordered sequence of data. Unlike aggregation, which condenses rows into buckets, or sequential analysis, which compares neighbors, running totals maintain a growing sum that updates with each successive row.

Signals include phrases like "running total," "cumulative," "so far," "up to this point," or "as of each date." Sometimes, the problem doesn't explicitly use these terms but describes a row-by-row threshold check, which is a running total in disguise.

The fundamental approach is SUM() OVER (ORDER BY ...). The ORDER BY clause within the window function dictates the sequence of accumulation, with the SUM progressively adding values. If the accumulation needs to reset for different categories, you'd add a PARTITION BY clause.

A clear example is Last Person to Fit in the Bus.

The problem asks to find the last person who can board a bus with a 1000 kg weight limit, with boarding occurring in the order of their turn column. While "running total" isn't explicitly mentioned, this is precisely what's required: you accumulate weight person by person, halting when the limit is reached.

WITH cumulative AS (
    SELECT
        person_name,
        SUM(weight) OVER (ORDER BY turn) AS total_weight
    FROM Queue
)
SELECT person_name
FROM cumulative
WHERE total_weight <= 1000
ORDER BY total_weight DESC
LIMIT 1;

The CTE computes a running total of weight, ordered by the turn column. The outer query then filters to include only rows where the total_weight remains within the 1000 kg limit and selects the final person in that sequence. Notably, there's no GROUP BY, as this isn't aggregation; each row retains its identity, augmented with a cumulative value.

Common mistakes include omitting the crucial ORDER BY within the window function (which makes the accumulation order undefined and the results unpredictable), confusing this pattern with GROUP BY (which collapses rows, whereas window functions preserve them), and incorrect use of PARTITION BY, leading to either too frequent resets or a complete lack of resets where they're needed.

For more practice, try Restaurant Growth, which requires a rolling 7-day average of restaurant spending. This utilizes the same SUM() OVER (ORDER BY ...) mechanism but with a RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW frame, adding a layer of temporal complexity to the basic running total.

Conclusion

Ultimately, the goal isn't to commit these twelve patterns to memory as isolated facts. Instead, it's about repeatedly engaging with problems until a mental "click" occurs, and you instinctively recognize a familiar structure. This is the essence of true SQL mastery: the blank editor isn't testing your recall of precise syntax, but rather your innate understanding of the logical path forward. When you know where you're going, the specific keywords and functions become mere tools to execute a well-defined plan.

While this article covers the foundational patterns, you'll naturally encounter more complex scenarios that combine these building blocks. For example, "Find employees earning above their department average" isn't a wholly new pattern; it's a window function (AVG() OVER (PARTITION BY department)) applied to each row, followed by a filter—a combination of running totals logic with a distinct aggregate. Similarly, pivoting rows into columns is simply conditional aggregation using CASE WHEN inside SUM or COUNT. Once these core patterns are internalized, crafting sophisticated queries by combining them becomes a natural extension of your problem-solving toolkit.

In an age where AI can handle the mechanics of query generation, your ability to *conceptualize* the solution—to identify the underlying patterns—is more valuable than ever. It's the skill that distinguishes a proficient data professional from an AI prompt engineer, ensuring you remain indispensable in the evolving landscape of data work.

Pattern	Signal	Core Tool
Aggregation	"for each," "how many," "per user"	`GROUP BY` + `HAVING`
Conditional Aggregation	multiple metrics, status breakdown	`CASE WHEN` inside `SUM/COUNT`
Top N Per Group	"latest," "highest," "top 3 per"	`DENSE_RANK() OVER (PARTITION BY)`
Sequential Analysis	"previous," "next," "consecutive"	`LAG / LEAD`
Event Pairing	"start/stop," "duration," "session"	`LEAD` partitioned by entity
Gaps and Islands	"streak," "consecutive days," "in a row"	`ROW_NUMBER` subtraction trick
Deduplication	"duplicate," "keep one," "unique per"	`ROW_NUMBER() WHERE rn = 1`
Existence / Anti-Existence	"without," "never," "has no"	`LEFT JOIN IS NULL / NOT EXISTS`
Self-Join	"same table reference," "reports to"	`JOIN` table to itself with aliases
Cartesian Expansion	"all combinations," "missing pairs"	`CROSS JOIN` + `LEFT JOIN`
Interval Overlap	"conflicting," "double booked"	`a_start < b_end AND b_start < a_end`
Running Totals	"cumulative," "so far," "up to"	`SUM() OVER (ORDER BY)`

About the Author: Ramy Radad

Ramy Radad is a Senior Systems Engineer with extensive hands-on experience in enterprise IT infrastructure. He specializes in managing Office 365 environments, deploying advanced Access Points and networking solutions, and integrating Smart Locks and Biometric attendance devices. Through his work, he has resolved hundreds of complex technical issues for businesses worldwide.

Tags:SQL Interviews Career Databases AI LLMs Data Science Programming

What do you think about this article?

Discussion

Loading comments...