Garbled Subtitles? Fix Broken Subtitle Encoding (UTF-8, EUC-KR, Shift-JIS & Mojibake)

Q: Why do the timestamps and numbers look fine but only the words are broken?

Because the digits, colons, commas, and --> arrows that make up an SRT's structure map to the same bytes in UTF-8 and in these legacy codepages. So the structure of the file survives any encoding mismatch untouched — it's only the non-ASCII characters (Korean, Japanese, accented Latin, Cyrillic) that get mangled. That's actually a useful diagnostic: if the timing lines are clean and only the dialogue is gibberish, you're looking at an encoding problem, not a corrupt file.

By Picute Team·Published on June 25, 2026·8 min read

subtitlesencodingutf-8mojibakesrttutorial

TL;DR

Who this is for:: Anyone who opens an SRT and sees garbled text instead of words — especially Korean, Japanese, or Chinese subtitles pulled from older sources, and accented European text
The problem:: The timestamps look fine but the actual text is gibberish (ì•ˆë…•, cafÃ©) or question marks. It looks like a corrupt file, but it usually isn't — the bytes are being decoded with the wrong character map
Bottom line:: Subtitles are bytes; a character encoding maps bytes to characters. Open a file saved in one encoding (UTF-8, or a legacy codepage like EUC-KR/CP949, Shift-JIS, Big5, Windows-1252) while assuming another, and you get mojibake. If you see readable-but-wrong characters the bytes are intact → re-open with the right encoding and re-save as UTF-8 (no BOM). If you see ? or � already baked into the file, those bytes are lost → re-export from the source or transcribe fresh.

Generate clean UTF-8 subtitles

You open a subtitle file and the timestamps are perfect, but the actual dialogue reads like ì•ˆë…• or cafÃ© or a row of ???. It looks like the file is corrupt. It almost never is — the text is all there, it's just being decoded with the wrong character map. This guide explains why that happens, how to tell a fixable problem from a genuinely lost one, and exactly how to get clean text back.

What "encoding" actually means

A subtitle file is just bytes on disk. A character encoding is the lookup table that turns those bytes into letters you can read. The same byte can mean completely different things depending on which table you use — so if a file was written with one encoding and read back with another, every non-English character comes out wrong. That mismatch is the entire problem; the file itself is fine.

There are two families of encoding you'll meet:

UTF-8 — the modern, universal encoding. It can represent every language on earth, and it's what every current player, editor, and platform expects. Non-English characters take several bytes each.
Legacy codepages — older, region-specific tables, each covering one language group:
- Korean: EUC-KR and CP949 (also called UHC / Windows-949, a superset of EUC-KR)
- Japanese: Shift-JIS (CP932)
- Simplified Chinese: GB2312 / GBK (CP936) / GB18030 (a backward-compatible superset, not the same as CP936)
- Traditional Chinese: Big5 (CP950)
- Western European: Windows-1252 / ISO-8859-1 (Latin-1)
- Cyrillic: Windows-1251

Mojibake — the Japanese word for this exact phenomenon, now the standard term — happens whenever a file written in one of these is read as another.

The 30-second diagnosis: what your garbage characters mean

The kind of garbage tells you whether the text is recoverable.

What you see	What it means	Recoverable?
Readable-but-wrong characters: `ì•ˆë…•`, `cafÃ©`, `Ð¿Ñ€Ð¸`	Mojibake — right bytes, wrong decoder	Yes — re-decode with the correct encoding
`?` or `�` (replacement diamonds) written into the file	A previous save re-encoded lossily — the original bytes are gone	Usually no — re-export from source
Empty boxes / "tofu": `□ □ □`	A missing font, not an encoding problem	N/A — install/embed a font with those glyphs

The first row is by far the most common, and it's completely fixable. The trick is that mojibake runs in two opposite directions, and the look of the garble tells you which — and therefore the fix.

UTF-8 read as a Western codepage → accented-Latin soup. The file is already UTF-8; a player or editor is decoding it as Windows-1252:

A UTF-8 Korean file shows ì•ˆë…• (each Korean character's several UTF-8 bytes get drawn as separate Latin symbols) — the real Korean text is one correct re-read away.
A UTF-8 café shows cafÃ© — the classic Ã© standing in for é, because é is the two UTF-8 bytes C3 A9 (C3 → Ã, A9 → ©).

The bytes are already right, so the fix is simply to make the tool read the file as UTF-8.

A legacy codepage read as UTF-8 → the reverse, and the usual cause of broken Korean/Japanese/Chinese subtitles: an old EUC-KR/CP949, Shift-JIS, or Big5 file opened by a UTF-8 tool, which shows � marks or scrambled glyphs. Here you re-open declaring that legacy codepage, then save as UTF-8.

Crucially, the timestamps stay clean in every case, because the bytes an SRT uses for its structure — the digits, the : and , in timestamps, the --> arrow, the line breaks — map to the same characters in UTF-8 and in these legacy codepages. Clean timing + garbled words = an encoding problem, full stop.

How to fix recoverable mojibake

The fix is always the same shape: re-open the file declaring its real encoding (UTF-8 or the legacy codepage, per the two cases above), confirm the text reads correctly, then save it as UTF-8. Pick whichever tool you have, and try encodings until the preview is right.

Subtitle Encoding Converter (free, in your browser)

The fastest option needs nothing installed: open the Subtitle Encoding Converter, upload the garbled file, and step through the source encodings — EUC-KR/CP949, Shift-JIS, GB18030/GBK, Big5, Windows-1252/1251 — until the live preview reads correctly, then download clean UTF-8 (no BOM). It runs entirely in your browser, so the file never leaves your device. If you'd rather use a desktop editor, any of these do the same job:

Subtitle Edit (free, Windows)

Opens files with auto-detection. If it guessed wrong, use the encoding dropdown (top of the window) to switch to the correct codepage until the preview reads correctly, then File → Save As and choose UTF-8. The most reliable point-and-click option.

Notepad++ (free, Windows)

Open the file. Encoding → Character sets → pick the region whose codepage makes the text readable (e.g. Korean → EUC-KR for an old CP949 file) to reinterpret the existing bytes. If instead the file is really UTF-8 shown as Latin soup, choose Encoding → UTF-8 to reinterpret.
Confirm the text now reads correctly.
Encoding → Convert to UTF-8 (not "UTF-8-BOM").
Save.

VLC (any platform)

Preferences → Subtitles / OSD → Default encoding → set it to the file's real codepage (e.g. Korean (EUC-KR/CP949)). Note this fixes the display while playing — it does not rewrite the file, so other apps will still see the original bytes.

iconv (command line, macOS/Linux)

iconv -f EUC-KR -t UTF-8 broken.srt > fixed.srt

Swap EUC-KR for SHIFT-JIS, GBK, BIG5, WINDOWS-1252, etc. as needed.

Whatever you use, save as UTF-8 without a BOM (see below). Once the file is clean UTF-8, the rest of the toolkit works on it normally — convert it to VTT, pull a plain-text transcript, or fix its timing if that's also off.

Open the free SRT → text extractorOnce your file is clean UTF-8, strip the timecodes for a copy-ready transcript — in your browser

The BOM trap

When you save UTF-8, some editors prepend a Byte Order Mark — three invisible bytes (EF BB BF) at the very start of the file. UTF-8 doesn't need one, and for SRT it actively causes trouble: the BOM sits immediately before the first cue's number, so some parsers fail to recognize cue 1, or you see a stray ï»¿ before the first subtitle (those three BOM bytes drawn as Western characters).

Always choose "UTF-8" rather than "UTF-8 with BOM" / "UTF-8-BOM" when your editor offers both. In Notepad++ that's Encoding → Convert to UTF-8 (the BOM-less option).

When the text is genuinely gone

If the file already contains literal ? or � characters — not as a display artifact, but written into the bytes — then somewhere upstream a tool re-saved the file in an encoding that couldn't represent those characters, and replaced each one with a placeholder. That conversion is lossy and one-way: the original characters no longer exist in this file, and no amount of re-decoding brings them back.

Your options at that point:

Re-download / re-export the subtitles from the original source, this time saving as UTF-8.
Transcribe the video fresh, which sidesteps the broken file entirely.

A note for Korean (and CJK) subtitles

This problem is most common with Korean, Japanese, and Chinese subtitle files, because those languages have widely-used legacy codepages that predate UTF-8 (EUC-KR/CP949, Shift-JIS, GBK, Big5). A Korean SRT downloaded from an older site or ripped from a DVD is very often CP949, and opening it in a UTF-8-default editor or player shows � marks or scrambled glyphs instead of the Korean text. Reinterpret it as CP949 (or EUC-KR), save as UTF-8, and it's fixed for good — and now portable to every modern player.

Avoid encoding problems entirely: generate fresh

Every mojibake headache traces back to the same root: you're reusing a subtitle file that was saved in some legacy codepage. The clean way out is to not start from a legacy-encoded file at all.

When you generate subtitles from the audio, the output is UTF-8 from the first byte — there's no regional codepage anywhere in the chain to mismatch, so garbled text simply never appears. Picute transcribes your video (or a pasted YouTube URL) and gives you a clean UTF-8 SRT, plus a plain-text transcript and an optional burned-in video, in 70+ languages.

Use the re-decode steps above to rescue a file you already have; generate fresh when you'd rather never see ? again.

Frequently asked questions

My Korean subtitles show things like ì•ˆë…• or ??? — can I get the real text back?

It depends on which garble you see — the look tells you the fix. Readable-but-wrong Latin soup like ì•ˆë…• or cafÃ© is mojibake with the underlying bytes fully intact; they're just being decoded with the wrong map. ì•ˆë…• specifically means the file is already UTF-8 and your player or editor is reading it as a Western codepage — so the fix is to point that tool at UTF-8 (the bytes are right; only the viewer is wrong). The opposite, and the more common 'broken Korean subtitles' case, is an older file saved in EUC-KR/CP949 (or a Japanese file in Shift-JIS) opened by a UTF-8 tool, which shows � or scrambled glyphs — there you re-open declaring that legacy codepage, then save as UTF-8. Either way, the text returns the moment the encoding matches. Plain question marks (?) or replacement diamonds (�) that stay even after you try the right encoding are different: a previous tool already re-encoded the file and threw the original bytes away, so they can't be recovered here — you'll need to re-download or re-export from the original source, or generate the subtitles again.

What encoding should I save SRT files as?

UTF-8 without a BOM. UTF-8 is the universal modern encoding — every current player, editor, and platform expects it, and it can represent every language, so you never hit a codepage mismatch again. The 'without BOM' part matters for SRT specifically: a BOM is three invisible bytes (EF BB BF) some editors prepend to UTF-8 files, and because they sit immediately before the first cue number, some parsers fail to recognize cue 1 or render a stray ï»¿ (the BOM bytes shown as Western text). Save plain UTF-8 and the problem disappears.

Why do the timestamps and numbers look fine but only the words are broken?

Because the digits, colons, commas, and --> arrows that make up an SRT's structure map to the same bytes in UTF-8 and in these legacy codepages. So the structure of the file survives any encoding mismatch untouched — it's only the non-ASCII characters (Korean, Japanese, accented Latin, Cyrillic) that get mangled. That's actually a useful diagnostic: if the timing lines are clean and only the dialogue is gibberish, you're looking at an encoding problem, not a corrupt file.

Subtitles look fine in VLC on my computer but turn to garbage on my phone or TV — why?

Different players assume different default encodings for subtitle files that don't declare one. VLC on desktop lets you set a 'Default encoding' and may be guessing your file's legacy codepage correctly, while a TV or phone app assumes UTF-8 and chokes on the same bytes. The fix isn't to configure every device — it's to convert the file itself to UTF-8 once, so every player reads it correctly regardless of its default. A file saved as clean UTF-8 is the portable answer.

How do I stop this from happening in the first place?

Keep every subtitle file in UTF-8. Whenever a subtitle tool offers an encoding choice on export, pick UTF-8 (without BOM). Avoid re-saving legacy-codepage files through tools that default to a regional encoding. And when you generate subtitles from the audio rather than reusing a downloaded file, the output is UTF-8 from the start — there's no legacy codepage in the chain to mismatch, so mojibake never appears.