Counting Consecutive Repeating Segments in URL Strings via Trino SQL 

0
30

Short Summary: Sometimes URL paths repeat by mistake because of tracking problems or redirect loops.. In this guide, you’ll learn how to find and count those repeats using Trino SQL. By using simple SQL tools like CTEs, arrays, and window functions, you can break a long link into smaller pieces and clean up the data without using complex regex. 

The Challenge 

Data analysts and SQL developers often run into “dirty” URL data where path segments repeat because of tracking issues, setup mistakes, or redirect loops. It may not look serious at first, but these repeats can affect web analytics and SEO reports. The main challenges in fixing this problem include: 

  • Finding parts of a link that are exactly the same and appear right next to each other (for example, /product/product/).
  • Correcting splitting strings by the ‘/’ character while maintaining the original sequence.
  • Processing millions of rows in Trino (formerly Presto), where heavy regex can slow things down.
  • Trino does not have a built-in feature to automatically find repeating parts in one column, so you have to build the logic yourself. 

The Solution 

Finding repeating segments in Trino comes down to writing clear SQL logic. The idea is simple: break the URL into smaller parts, look at them one by one, and check if any part repeats right after the other. To do this, we split the URL into arrays, turn those arrays into rows while keeping their order, and use window functions to compare each segment with the previous one. 

In many data workflows, Trino is used to process and clean data before it is stored in relational databases such as PostgreSQL or SQL Server. Once the cleaned data lands there, tools like dbForge Edge help developers work with the results more efficiently. 

For example, you can: 

  • Manage the destination databases where cleaned Trino data is stored, such as PostgreSQL, SQL Server, MySQL, or Oracle.
  • Use the SQL Editor to analyze the processed data and refine queries.
  • Generate sample URLs or test data when validating query logic. 

Step-by-Step Guide 

The following steps show how this logic works directly in Trino SQL. 

Step 1: Prepare the Data with a CTE 

First, we define the source data. A CTE in SQL lets you isolate the URL strings and prepare them for processing without affecting the underlying tables. 

WITH raw_data AS ( 
    SELECT 1 as id, 'example.com/shop/shop/item1' as url 
    UNION ALL 
    SELECT 2, 'website.org/home/about/about/contact' 

SELECT * FROM raw_data; 

Step 2: Split and Unnest URL Segments 

Use split to convert the URL into an array. Then use UNNEST with WITH ORDINALITY to expand it into rows while keeping the segment order. 

WITH split_segments AS ( 
    SELECT  
        id,  
        url,  
        segment,  
        pos 
    FROM raw_data 
    CROSS JOIN UNNEST(split(url, '/')) WITH ORDINALITY AS t(segment, pos) 

SELECT * FROM split_segments; 

Step 3: Identify Consecutive Repeats 

Use LAG() to compare each segment with the one before it. If they are the same, mark it as a repeat. During testing, some users may also use simple Python Script to create or check sample data. 

WITH checks AS ( 
    SELECT  
        id,  
        url,  
        segment, 
        pos, 
        CASE WHEN segment = LAG(segment) OVER (PARTITION BY id ORDER BY pos)  
             THEN 1 ELSE 0 END as is_repeat 
    FROM split_segments 

SELECT * FROM checks; 

Step 4: Aggregate and Final Count 

Finally, group by the ID and URL to count how many consecutive repetitions were found in each string. 

SELECT  
    id,  
    url,  
    SUM(is_repeat) as total_repeats 
FROM checks 
GROUP BY id, url 
HAVING SUM(is_repeat) > 0; 

With the query in place, let’s look at why this approach is useful in real-world data work. 

Key Benefits 

Applying this logic improves data quality and makes URL structure issues easier to detect. 

Benefit Description 
Data Accuracy Removes noise in web analytics caused by repeated URL paths. 
Performance Uses native Trino window functions designed for distributed processing. 
Flexibility Can be adjusted to detect three or more consecutive repeats by modifying the window logic. 

Overall, this approach improves accuracy, scales well in Trino, and stays flexible for more complex repeat patterns. 

Conclusion 

Finding repeating URL segments in Trino SQL comes down to using arrays and window functions. By splitting the URL, expanding the segments, and comparing them in order, you can clean large datasets with clear logic. Tools like dbForge Edge make this easier by giving you a reliable space to write, test, and refine your queries before running them at scale. 

FAQ 

Can this logic detect non-consecutive repeats? 

No. The LAG() function only checks the previous row. To detect non-consecutive repeats, use a COUNT() window function partitioned by the segment name. 

Does this work for different delimiters? 

Yes, simply change the delimiter in the split(url, ‘/’) function to your desired character, such as a hyphen or dot. 

Is this approach efficient for large tables? 

Yes. Trino is built for parallel processing. For best results, use a high-cardinality key (such as a unique ID) in your PARTITION BY clause to distribute the workload evenly.