intermediate

Remove Duplicates

Remove duplicate values from a list.

Duplicate removal is essential for data cleaning and normalization. This exercise teaches you about hash-based lookups, order preservation, and the trade-offs between different data structures.

📚 Concepts & Theory

Removing duplicates from a list while preserving order is a common data cleaning task. Python offers multiple approaches, each with different trade-offs regarding performance and order preservation.

Understanding Uniqueness in Lists:

Method 1: Using set() (Fastest, Order Not Preserved)

numbers = [1, 2, 2, 3, 4, 3, 5]
unique = list(set(numbers)) # [1, 2, 3, 4, 5] - order may vary

Method 2: dict.fromkeys() (Order Preserved in Python 3.7+)

numbers = [1, 2, 2, 3, 4, 3, 5]
unique = list(dict.fromkeys(numbers)) # [1, 2, 3, 4, 5]

Method 3: Manual Loop with Tracking (Order Preserved)

numbers = [1, 2, 2, 3, 4, 3, 5]
unique = []
seen = set()
for num in numbers:
if num not in seen:
unique.append(num)
seen.add(num)

Method 4: List Comprehension with Tracking

seen = set()
unique = [x for x in numbers if not (x in seen or seen.add(x))]

Performance Comparison:

  • set(): O(n) time, but doesn't preserve order in older Python

  • dict.fromkeys(): O(n) time, preserves insertion order (Python 3.7+)

  • Manual with set tracking: O(n) time, O(n) space, explicit control

  • List comprehension: O(n) time, more compact but less readable
When to Use Each:
  • Order doesn't matter → list(set(data))

  • Order matters → list(dict.fromkeys(data))

  • Need custom logic → manual loop with set tracking

🎯 Your Challenge

Write remove_duplicates that preserves order.

📝 Starter Code

Python
def remove_duplicates(items):
    pass

print(remove_duplicates([1, 2, 2, 3]))
  • Use a set to track elements you've already seen
  • Sets have O(1) lookup time for membership testing
  • Build a new list containing only first occurrences
  • Check each element against your seen set
  • Add elements to both the result list and seen set

Solution

Python
def remove_duplicates(items):
    seen = set()
    result = []
    for item in items:
        if item not in seen:
            seen.add(item)
            result.append(item)
    return result

print(remove_duplicates([1, 2, 2, 3]))

Explanation

This solution uses a set to track elements we've already seen (O(1) lookup time), and builds a new list containing only unique elements in their original order. For each number, we check if it's in the seen set—if not, we add it to both the unique list and the seen set. Using a set for membership testing is crucial for O(n) overall complexity; using a list would make it O(n²). The final result preserves the order of first occurrence. Time complexity: O(n). Space complexity: O(n) for both unique list and seen set.

⚠️ Common Mistakes to Avoid

  • Using list instead of set for tracking (O(n²) complexity)
  • Not preserving original order when required
  • Forgetting that set() doesn't preserve order in Python < 3.7
  • Modifying list while iterating over it
  • Not handling unhashable types (like lists of lists)

❓ Frequently Asked Questions

In Python 3.7+, insertion order is preserved. In earlier versions, order is lost.
You'll need a different approach, possibly converting to tuples or using a custom comparison.
list(set(data)) is fastest but may not preserve order. dict.fromkeys() is fast and preserves order.
You need to define what makes dictionaries 'equal' and use custom logic, possibly with tuple conversion.

🔗 Related Exercises