6 min read AI Automation

Data Cleansing with AI: Master Records and Duplicates Sorted

How to cleanse master data, find duplicates and normalise addresses with AI - with copy-paste prompts for clean, reliable datasets.

“Müller GmbH”, “Mueller GmbH”, “Müller G.m.b.H.”, “Fa. Müller” - four entries, one customer. Recognise this from your CRM? Over the years, every database accumulates dead records, duplicates, and inconsistent spellings. And at some point accounting calls because the same company has been sent three reminders - under three different names.

At a wholesale client recently, we examined the customer database: of 12,000 records, around 1,800 were duplicates or faulty. That’s not an exception - it’s the rule. The good news: AI is remarkably good at untangling exactly this kind of chaos.

The Master Data Problem in Mid-Market Companies

Clean data is the foundation for functioning processes - and at the same time the most neglected task in the company. Typical problems:

  • Duplicates: The same customer created multiple times, spelled slightly differently
  • Inconsistent formats: Sometimes “St.”, sometimes “Street”, sometimes “Str.”
  • Missing values: Empty postal code fields, missing contact persons
  • Typos: “Hamubrg” instead of “Hamburg”, transposed digits

The result: Wrong analyses, duplicate mailings, embarrassing letters, and extra work in every department that uses this data.

AI as Your Data Cleansing Helper

Language models are excellent at recognising patterns and standardising unstructured entries. They recognise that “Müller GmbH” and “Mueller G.m.b.H.” are most likely the same company - something a rigid Excel rule can’t manage.

For smaller volumes of data, an AI tool with copy-paste is often enough. For large, recurring datasets, you combine the AI with a small script or automation - more on that later.

How It Works in Practice

Step 1: Prepare a data excerpt (anonymised if necessary)

Step 2: Define the cleansing rules in the prompt

Step 3: Have the AI standardise and flag duplicates

Step 4: Check the result and feed it back into the system

My Proven Prompt Template for Data Cleansing

I use this prompt to normalise address and customer data:

You are my data cleansing assistant. Here is a list of records.

CLEAN ACCORDING TO THESE RULES:
- Company name: write the legal form consistently as "GmbH", "AG", "Ltd"
- Always spell out "Street" (not "St.")
- Check city names for correct spelling
- Correct obvious typos

Return the result as a table: Original | Cleaned | What was changed

Flag uncertain cases that I should check manually.

Example: Find Duplicates

Here is a list of company names with addresses.
Find likely duplicates - even with different spellings.
Group related entries and provide a recommended "master record" for each group.
Briefly explain why you consider the entries identical.

Example: Normalise Address Data

Break these unformatted addresses into clean fields:
Street, house number, postal code, city, country.
If information is missing or unclear, mark the field as "CHECK".

Example: Validate Data

Check this list for plausibility:
- Are the postal codes valid (5 digits)?
- Do postal code and city match?
- Are the email addresses formally correct?
List only the faulty entries with the respective problem.

What You Should Watch Out For

Never overwrite unchecked: AI makes suggestions - the final decision whether two records are really identical is yours. Always work with a copy, never directly in the production system.

Mind data protection: Customer data is personal data. For larger or sensitive datasets, only use GDPR-compliant tools with a data processing agreement - or keep the processing in your own environment.

Work in chunks: Don’t give the AI 10,000 rows at once. Work in blocks of 50-100 records; this reduces errors and makes checking easier.

Document changes: Always have “Original | Cleaned | Change” output. That way you can trace every correction and roll back if in doubt.

Have edge cases flagged: Real data quality emerges when the AI honestly says: “This is uncertain.” Demand that explicitly.

Checklist: Data Cleansing with AI

  • Always work with a copy, never in the production system
  • Choose a GDPR-compliant tool
  • Define cleansing rules clearly in the prompt
  • Work in blocks of 50-100 records
  • Request output as “Original | Cleaned | Change”
  • Check uncertain cases manually
  • Feed the cleaned data back in a controlled way

Quick Win for Today

Try this in 5 minutes: Export 50 rows from your CRM or customer list (anonymised if necessary). Copy the duplicate prompt above and let the AI search for duplicate entries. I promise you: you’ll find at least one duplicate you didn’t know about.

When Data Maintenance Becomes a Permanent Topic

A one-off cleansing with AI is a good start. But data quality isn’t a project - it’s a state, and it only lasts if new data doesn’t get dirty in the first place. This is exactly where system integration comes in: when CRM, ERP, and web shop are cleanly connected, data is entered only once and kept consistent everywhere. Duplicates never even arise.

We help mid-market companies connect their systems so that data stays automatically synchronised and clean. Take a look at our system integration or let’s talk about your dataset.

Try the quick win above first - in the next newsletter we’ll cover how to automatically categorise incoming inquiries and tickets with AI.

Best regards
Dennis

Dennis Pfeifer
Dennis Pfeifer
Founder & IT Consultant
LinkedIn

Related articles

Want more insights?

Get new articles delivered to your inbox. No spam, unsubscribe anytime.

No spam. Unsubscribe anytime.Privacy Policy

Have questions?

Let's discuss your project.