Short Summary: Sometimes URL paths repeat by mistake because of tracking problems or redirect loops.. In this guide, you’ll learn how to find and count those repeats using Trino SQL. By using simple SQL tools like CTEs, arrays, and window functions, you can break a long link into smaller pieces and clean up the data without using complex regex.
The Challenge
Data analysts and SQL developers often run into “dirty” URL data where path segments repeat because of tracking issues, setup mistakes, or redirect loops. It may not look serious at first, but these repeats can affect web analytics and SEO reports. The main challenges in fixing this problem include:
- Finding parts of a link that are exactly the same and appear right next to each other (for example,
/product/product/). - Correcting splitting strings by the ‘/’ character while maintaining the original sequence.
- Processing millions of rows in Trino (formerly Presto), where heavy regex can slow things down.
- Trino does not have a built-in feature to automatically find repeating parts in one column, so you have to build the logic yourself.
The Solution
Finding repeating segments in Trino comes down to writing clear SQL logic. The idea is simple: break the URL into smaller parts, look at them one by one, and check if any part repeats right after the other. To do this, we split the URL into arrays, turn those arrays into rows while keeping their order, and use window functions to compare each segment with the previous one.
In many data workflows, Trino is used to process and clean data before it is stored in relational databases such as PostgreSQL or SQL Server. Once the cleaned data lands there, tools like dbForge Edge help developers work with the results more efficiently.
For example, you can:
- Manage the destination databases where cleaned Trino data is stored, such as PostgreSQL, SQL Server, MySQL, or Oracle.
- Use the SQL Editor to analyze the processed data and refine queries.
- Generate sample URLs or test data when validating query logic.
Step-by-Step Guide
The following steps show how this logic works directly in Trino SQL.
Step 1: Prepare the Data with a CTE
First, we define the source data. A CTE in SQL lets you isolate the URL strings and prepare them for processing without affecting the underlying tables.
WITH raw_data AS (
SELECT 1 as id, 'example.com/shop/shop/item1' as url
UNION ALL
SELECT 2, 'website.org/home/about/about/contact'
)
SELECT * FROM raw_data;
Step 2: Split and Unnest URL Segments
Use split to convert the URL into an array. Then use UNNEST with WITH ORDINALITY to expand it into rows while keeping the segment order.
WITH split_segments AS (
SELECT
id,
url,
segment,
pos
FROM raw_data
CROSS JOIN UNNEST(split(url, '/')) WITH ORDINALITY AS t(segment, pos)
)
SELECT * FROM split_segments;
Step 3: Identify Consecutive Repeats
Use LAG() to compare each segment with the one before it. If they are the same, mark it as a repeat. During testing, some users may also use simple Python Script to create or check sample data.
WITH checks AS (
SELECT
id,
url,
segment,
pos,
CASE WHEN segment = LAG(segment) OVER (PARTITION BY id ORDER BY pos)
THEN 1 ELSE 0 END as is_repeat
FROM split_segments
)
SELECT * FROM checks;
Step 4: Aggregate and Final Count
Finally, group by the ID and URL to count how many consecutive repetitions were found in each string.
SELECT
id,
url,
SUM(is_repeat) as total_repeats
FROM checks
GROUP BY id, url
HAVING SUM(is_repeat) > 0;
With the query in place, let’s look at why this approach is useful in real-world data work.
Key Benefits
Applying this logic improves data quality and makes URL structure issues easier to detect.
| Benefit | Description |
|---|---|
| Data Accuracy | Removes noise in web analytics caused by repeated URL paths. |
| Performance | Uses native Trino window functions designed for distributed processing. |
| Flexibility | Can be adjusted to detect three or more consecutive repeats by modifying the window logic. |
Overall, this approach improves accuracy, scales well in Trino, and stays flexible for more complex repeat patterns.
Conclusion
Finding repeating URL segments in Trino SQL comes down to using arrays and window functions. By splitting the URL, expanding the segments, and comparing them in order, you can clean large datasets with clear logic. Tools like dbForge Edge make this easier by giving you a reliable space to write, test, and refine your queries before running them at scale.
FAQ
Can this logic detect non-consecutive repeats?
No. The LAG() function only checks the previous row. To detect non-consecutive repeats, use a COUNT() window function partitioned by the segment name.
Does this work for different delimiters?
Yes, simply change the delimiter in the split(url, ‘/’) function to your desired character, such as a hyphen or dot.
Is this approach efficient for large tables?
Yes. Trino is built for parallel processing. For best results, use a high-cardinality key (such as a unique ID) in your PARTITION BY clause to distribute the workload evenly.

