How to Clean Messy Data in Excel: Step-by-Step Guide

Coding Liquids blog cover featuring Sagnik Bhattacharya for How to Clean Messy Data in Excel Fast, with messy spreadsheet elements transforming into organised rows.
Coding Liquids blog cover featuring Sagnik Bhattacharya for How to Clean Messy Data in Excel Fast, with messy spreadsheet elements transforming into organised rows.

Every data analyst knows the reality: 80% of the work is cleaning data, 20% is actual analysis. Messy data — duplicates, blanks, inconsistent formatting, extra spaces, mixed cases, numbers stored as text, stray non-printable characters — can silently derail any report. A VLOOKUP that returns #N/A, a SUMIFS that undercounts by half, a pivot table with "Kolkata" and "kolkata " showing as different cities: all of these trace back to data quality, not formulas.

Coming Soon

Complete Excel Guide with AI Integration

Master formulas, pivot tables, data analysis, and charts — with AI integration.

Learn more

This tutorial walks through the ten techniques that handle roughly 95% of real-world cleaning tasks, using only Excel's built-in tools — no add-ins, no Power Query (though Power Query is what you graduate to once these techniques become muscle memory). Every section includes a real-world scenario taken from freelance client work and a beginner pitfall that will save you an hour of backtracking the first time you hit it.

Follow me on Instagram@sagnikteaches

Work on a copy of your data, not the original. Press Ctrl+Shift+End to confirm how far your data extends before you run anything destructive, and keep the row count visible on the status bar so you notice when Remove Duplicates or a delete strips more than you expected. If you are new to Excel, start with the Excel formulas guide — the techniques below assume you are comfortable with relative vs absolute references and the basic structure of an Excel worksheet.

Connect on LinkedInSagnik Bhattacharya Subscribe on YouTube@codingliquids

1. Remove Duplicates

The quickest win. Select your data, go to Data → Remove Duplicates. Choose which columns to check for duplicates. Excel will tell you how many duplicates were found and removed.

Real-world scenario: A marketing agency exports leads from three landing pages into one sheet. The same person has signed up twice using priya@company.com and PRIYA@company.com. Ticking the Email column alone still leaves both rows because Excel treats the two strings as different. The fix is to add a helper column =LOWER(TRIM(B2)), tick only that helper for Remove Duplicates, then delete the helper — you collapse the real duplicates without losing the original casing on the rows that survive.

Beginner pitfall: Remove Duplicates keeps the first occurrence and deletes the rest, which is usually fine — but if the later rows contain updated data (a newer phone number, a corrected postcode), you silently lose that update. Always sort by date descending first, so the freshest row is the one Excel keeps.

Pro tip: Before removing, use conditional formatting to highlight duplicates first (Home → Conditional Formatting → Highlight Cell Rules → Duplicate Values). Review them before deleting — sometimes duplicates are legitimate (two separate orders from the same customer on the same day is not the same as a data-entry error).

2. Fix Extra Spaces

Hidden spaces are the #1 cause of failed VLOOKUP formulas. The TRIM function removes all leading, trailing, and double spaces (it keeps single spaces between words, so "John Smith" with two spaces becomes "John Smith"):

=TRIM(A2)

For non-breaking spaces (CHAR(160), common in data copy-pasted from web pages and PDFs), TRIM alone does nothing because Excel treats CHAR(160) as a real character, not a space. Wrap it in SUBSTITUTE first:

=TRIM(SUBSTITUTE(A2, CHAR(160), " "))

And for fully non-printable garbage (line breaks inside a cell, Ctrl characters from legacy systems), stack CLEAN on top:

=TRIM(CLEAN(SUBSTITUTE(A2, CHAR(160), " ")))

Real-world scenario: A finance team reconciles invoice numbers from the accounting system against a CSV from the bank. The VLOOKUP returns #N/A for about 30 rows. Visual inspection shows nothing wrong. A quick =LEN(A2) on both sides reveals the bank export has a trailing space on every value — invisible, but enough to break equality. Wrapping both lookup columns in TRIM (then pasting as values) fixes every row in one pass.

Beginner pitfall: Running TRIM in a helper column and then deleting the original will blow up the helper's formulas because =TRIM(A2) now points at a deleted cell. The correct order is: TRIM in helper → copy helper → paste special → values into the original → delete helper.

3. Standardise Text Case

"john smith", "JOHN SMITH", and "John Smith" are three different values to Excel. Any downstream COUNTIFS, pivot table, or dedupe will treat them as three distinct rows. Fix it with:

  • =UPPER(A2) — JOHN SMITH
  • =LOWER(A2) — john smith
  • =PROPER(A2) — John Smith

Real-world scenario: An HR team imports employee names from three legacy HRIS systems, each with its own casing convention. Their headcount pivot shows 847 employees; the real number is 812. PROPER across the name column collapses the inflated count back to the truth — 35 ghost rows were just case variations of existing employees.

Beginner pitfall: PROPER is blind to context. It capitalises "mcDonald" as "Mcdonald", "O'Brien" as "O'brien", and acronyms like "NASA" as "Nasa". For a column that contains surnames, emails, or anything with acronyms, LOWER (for emails) or a manual pass with Find & Replace for the known exceptions is safer than trusting PROPER blindly.

4. Find and Replace

Ctrl+H opens Find and Replace. Powerful uses:

  • Replace "N/A", "-", "NULL" with blank (leave the "Replace with" field empty)
  • Replace old product codes with new ones in bulk after a SKU migration
  • Remove specific characters (search for the character, leave "Replace with" empty)
  • Use wildcards: * matches any characters, ? matches one character
  • Click Options to restrict to "Match case" or "Match entire cell contents" — essential when "IT" (department) must not rewrite every "it" in a description column

Real-world scenario: A logistics team has 40,000 rows of addresses where the state column contains every variant of "Maharashtra": "MH", "maharashtra", "Mahrashtra", "Maha.". One pass of Find & Replace for each known variant → one clean value. Total time: under two minutes. Doing the same thing row by row would take the rest of the afternoon.

Beginner pitfall: Leaving Find & Replace set to "workbook" scope will silently edit cells on every sheet, including reference tables you did not want touched. Before clicking Replace All, check the Within dropdown reads "Sheet", not "Workbook", and select the column first so the replace is scoped to just that column.

5. Split Data with Text to Columns

Got "First Last" in one column? Select the column → Data → Text to Columns:

  • Delimited — Split by comma, space, tab, or custom delimiter
  • Fixed width — Split at specific character positions (useful for bank statements and legacy mainframe exports)

In Microsoft 365, use the TEXTSPLIT function instead for more flexibility — it spills into a dynamic array and updates automatically when the source changes:

=TEXTSPLIT(A2, " ")
=TEXTSPLIT(A2, {",", ";"})

Real-world scenario: A CRM export dumps "Name | Email | Phone" into a single cell per lead. One pass of Text to Columns with | as a custom delimiter separates 8,000 records into three clean columns. Before pressing Finish, always set the destination cell to an empty column — otherwise Text to Columns overwrites the column to the right of your source without warning.

Beginner pitfall: Text to Columns remembers its last-used settings for the rest of the Excel session. If your previous split was on commas, and you now paste data into a cell, Excel will silently split that paste on commas too. Close and reopen Excel (or run Text to Columns once on a dummy cell with Tab as the delimiter) to reset the behaviour.

6. Handle Blank Cells

To find and fill blanks:

  1. Select your range
  2. Press Ctrl+G → Special → Blanks
  3. All blank cells are now selected
  4. Type a value (e.g. 0 or Unknown) or a formula (=A2 to copy from above) and press Ctrl+Enter to fill all at once

Real-world scenario: A pivot table from an accounting export shows "(blank)" as one of the row labels and throws off the totals. The source data has blank Region cells for rows where the exporter skipped the merged header. Ctrl+G → Blanks → =A2 → Ctrl+Enter fills each blank with the Region from the row above. Copy → Paste Special → Values, and the pivot cleans up in one refresh.

Beginner pitfall: Cells that look blank but contain an empty string ("") returned by a formula are not "blank" to Go To Special. =LEN(A2) will return 0 for both, but ISBLANK(A2) will return FALSE for the formula-blank. If Go To Special finds zero blanks when you can clearly see some, replace "" with an actual blank via Find & Replace (search for "", replace with nothing, or better: fix the upstream formula to return NA() or a sentinel like "-" instead).

7. Fix Numbers Stored as Text

If you see numbers left-aligned with a green triangle, they're stored as text. This is the single most common reason a SUM returns 0 when the cells clearly contain numbers. Fix it with any of these:

  • Select the cells → click the warning icon → "Convert to Number". To prevent this issue entirely, set up data validation rules on your input columns so the user cannot type text into a numeric field
  • Or: create a helper column with =A2*1 or =VALUE(A2), then paste-as-values back
  • Or: select all → Data → Text to Columns → Finish (without changing settings) — this is the fastest bulk fix for thousands of rows
  • Or: type 1 in an empty cell, copy it, select your text-numbers, Paste Special → Multiply. Every cell is now a real number in place.

Real-world scenario: A sales manager imports daily sales from a bank CSV. =SUM(D:D) returns 0. Every number in column D has a leading apostrophe from the CSV parser (this forces text storage). Text to Columns → Finish converts the entire column to real numbers in one click, and the sum updates instantly.

Beginner pitfall: Numbers with thousands separators ("1,234,567") or currency symbols ("£1,234") imported from web data will resist =VALUE and *1 because Excel cannot coerce them. Use =NUMBERVALUE(A2, ".", ",") in Microsoft 365, or strip the separators first with =VALUE(SUBSTITUTE(SUBSTITUTE(A2, ",", ""), "£", "")).

8. Extract Data with Formulas

Common extraction tasks:

  • Extract domain from email: =MID(A2, FIND("@",A2)+1, 100). For more formula techniques, see the advanced formulas guide
  • Extract first name: =LEFT(A2, FIND(" ",A2)-1)
  • Extract last name (single-space names): =MID(A2, FIND(" ",A2)+1, 100)
  • Extract numbers from text: =TEXTBEFORE(A2, " ") or =TEXTAFTER(A2, "-") in Microsoft 365; a combination of MID and SEQUENCE for older versions
  • Postcode from a UK address: =TEXTAFTER(A2, ", ", -1) (the -1 takes the last occurrence)

Real-world scenario: A recruiter has 2,400 LinkedIn profile URLs in one column and needs the username for a mail-merge. =TEXTAFTER(A2, "linkedin.com/in/") returns every handle clean in one spill. The same task with old-school LEFT/MID/FIND would take three nested formulas and a lot of off-by-one debugging.

Beginner pitfall: FIND is case-sensitive and errors on a missing match (#VALUE!); SEARCH is case-insensitive and also errors on a missing match. Wrap either in IFERROR(..., "") so a single missing "@" in the email column does not cascade to every downstream formula. For extractions you will run repeatedly, reach for Flash Fill (next section) or Power Query instead — they handle exceptions gracefully where formulas crash.

9. Use Flash Fill

Excel's Flash Fill (Ctrl+E) is magic for pattern-based cleaning. Type the desired result for 2-3 rows in the column next to your source, then press Ctrl+E — Excel detects the pattern and fills the rest. For recurring data cleaning tasks, Power Query can automate the entire process. Flash Fill works for:

  • Name reformatting ("Smith, John" → "John Smith")
  • Extracting parts of strings (pulling "Pune" out of "Pune, Maharashtra, 411001")
  • Combining data from multiple columns ("Priya" + "Sharma" → "Priya Sharma")
  • Reformatting phone numbers and dates (9876543210+91 98765 43210)
  • Masking sensitive data (priya@company.comp****@company.com)

Real-world scenario: A customer-support team needs to reformat 5,000 product SKUs from SKU-2024-AB-001 to AB001-2024. Type the first two outputs by hand, press Ctrl+E, and Excel works out the pattern and fills the remaining 4,998 rows in under a second. A formula solution would need nested TEXTAFTER + TEXTBEFORE and still break on any SKU that has a different number of hyphens.

Beginner pitfall: Flash Fill is not a live formula — it is a one-time transformation. If new rows arrive, Flash Fill does not extend to them automatically. Either re-run Ctrl+E on the new rows, or replace Flash Fill with a formula/Power Query for any workflow that refreshes. Flash Fill also fails silently on inconsistent patterns; always spot-check rows 50, 500, and 5,000 before trusting the output.

10. Data Cleaning Checklist

Run this checklist top-to-bottom on every dataset before you start any analysis. Most beginners skip it and spend the rest of the day wondering why their pivot totals do not match the finance team's numbers.

CheckTool/Method
DuplicatesRemove Duplicates / Conditional Formatting
Extra spacesTRIM function (and TRIM+CLEAN+SUBSTITUTE for web data)
Inconsistent caseUPPER/LOWER/PROPER (watch acronyms)
Blank cellsGo To Special → Blanks (then Ctrl+Enter to bulk-fill)
Numbers as textText to Columns / VALUE / Paste Special Multiply by 1
Date formatsFormat Cells → Date; DATEVALUE for text-dates
Spelling errorsFind & Replace with wildcards; F7 for Spell Check
Column splittingText to Columns / Flash Fill / TEXTSPLIT
Row count sanityStatus bar count vs expected (catches silent deletes)
Final formatConvert range to an Excel Table so new rows auto-inherit formats and formulas

Real-world scenario: A freelance analyst builds a reusable cleaning tab in every client workbook: a column for raw data, a column for =TRIM(CLEAN(SUBSTITUTE(A2,CHAR(160)," "))), then a column with =PROPER(...). New data pastes into column A; columns B and C auto-spill clean output via the Table's calculated columns. The cleaning step becomes a 5-second operation instead of a 30-minute ritual.

Beginner pitfall: "Finished cleaning" is a false summit. Before moving on, run =COUNTA(A:A)-COUNTBLANK(A:A), =SUMPRODUCT(1/COUNTIF(A2:A1000, A2:A1000)) (unique count), and =MIN(A:A)/=MAX(A:A) on numeric columns. If any of those look wrong, you still have messy data — the cleaning is not done yet.

Troubleshooting: Common Data-Cleaning Errors

These are the five errors that trip up almost every beginner the first time they clean real data.

TRIM did nothing — the spaces are still there

The "spaces" are not real spaces — they are non-breaking spaces (CHAR(160)) from a web copy-paste, or zero-width characters from a legacy export. TRIM only handles ASCII space (CHAR(32)). Use =TRIM(CLEAN(SUBSTITUTE(SUBSTITUTE(A2, CHAR(160), " "), CHAR(8203), ""))) to hit the common culprits.

Remove Duplicates says "0 removed" but I can see duplicates

Excel is comparing the entire row — one column has a trailing space, a different case, or an invisible character, so the two rows are not byte-for-byte identical. Add a helper column with =LOWER(TRIM(B2&"|"&C2&"|"&D2)) and tick only that helper column when running Remove Duplicates.

Flash Fill produced the wrong pattern

Your first 2-3 example rows were ambiguous — Flash Fill locked onto a pattern you did not intend. Undo (Ctrl+Z), add a fourth or fifth example row that disambiguates (especially rows that are edge cases — very short names, double-barrelled surnames, missing middle initials), then press Ctrl+E again.

Text to Columns destroyed the column to the right

Text to Columns has no "insert columns" option — it overwrites the columns to the right of your source. Always insert enough empty columns before running it, or change the Destination field on the final wizard step to a cell in an empty area of the sheet.

Numbers-as-text survived =VALUE(A2)

The "numbers" contain non-numeric characters — thousand separators, currency symbols, trailing spaces, or even a non-breaking space before the digit. Strip them first: =VALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,",",""),"£",""),CHAR(160),"")). For a one-shot bulk fix, Data → Text to Columns → Finish usually handles all of these in one step.

FAQ

How do I clean messy data in Excel?

Work on a copy of the data, never the original. Run this sequence top-to-bottom: (1) TRIM to remove extra spaces and CLEAN for non-printable characters, (2) PROPER/UPPER/LOWER for consistent case, (3) Find & Replace for known variants and typos, (4) Text to Columns or TEXTSPLIT to split combined fields, (5) Remove Duplicates (after deduping a helper column first), (6) Go To Special → Blanks to fill or delete gaps, (7) Text to Columns → Finish to convert numbers-as-text to real numbers. Wrap the whole thing in an Excel Table so new rows inherit the cleaning steps.

What are the most common data quality issues?

Extra leading/trailing spaces (the #1 cause of VLOOKUP #N/A errors), inconsistent capitalisation, duplicate records that differ only by case or whitespace, mixed date formats (DD/MM vs MM/DD), numbers stored as text (SUM returns 0), blank rows splitting a range so pivots stop at the gap, merged cells breaking formulas, and inconsistent category names ("NY" vs "New York" vs "new york "). TRIM + PROPER + Remove Duplicates fix roughly 80% of what you will meet in the wild.

Why does my VLOOKUP return #N/A when I can see the value?

99% of the time it is an invisible space or case-mismatch. Run =LEN(A2) on both the lookup value and the target — if the lengths differ, one side has hidden whitespace. Wrap both in TRIM and paste as values. For case-mismatches, VLOOKUP is case-insensitive by default, so case alone rarely causes #N/A — but it does cause phantom duplicates downstream.

Should I use Flash Fill or formulas for data cleaning?

Flash Fill for one-off cleans on static data — it is faster, requires no formula knowledge, and handles fuzzy patterns gracefully. Formulas (or Power Query) for any workflow that refreshes, because Flash Fill is a one-shot transformation and does not extend to new rows. Rule of thumb: if the data will never change, Flash Fill; if the data will be updated monthly, Power Query.

When should I graduate from formulas to Power Query?

When you catch yourself repeating the same cleaning steps each month, when your workbook is over 50MB, when TRIM + SUBSTITUTE chains become unreadable, or when the data source is multiple files/folders. Power Query records your cleaning steps once and replays them on refresh — a 10-step manual cleaning becomes a one-click Refresh All. Start with the Power Query guide when you hit any of those thresholds.

Sources & Further Reading

Related tutorials