WordPress and php’s Multibyte String

Last August I was going crazy trying to figure out why WordPress was dropping most of my posts.

The problem happened whenever I used an apostrophe or a quote (‘ or “). Whenever I saved a draft of the post, everything after the first apostrophe or quotation mark would disappear into oblivion. This, in spite of my database collation and wp-config file being matched to the utf-8 character set.

So, for example,

Here’s a little post. Doomed to be chopped off at the first sign of an apostrophe. Alas!

would become:

Here

At the time, I had narrowed the problem down to php’s multibyte string extension. I run my own server, so I could rebuild php without the mbstring extension and everything worked fine. When I rebuilt it again with mbstring back in the mix, the problem reappeared.

And yet…

Disabling mbstring has a few disadvantages.

  • For one thing, it’s one of the php extensions required for WordPress operation. Some very popular plugins (like Jetpack) use it. In fact, without mbstring enabled, Jetpack gives you a warning that certain things, like Twitter widgets, might not work correctly.
  • For another thing, disabling the extension limits WordPress’s language capacity. I might not care that I can’t blog in German or Japanese. But then again, someday I might. And disabling mbstring is obviously not going to work for anyone else who regularly blogs in languages other than English.
  • For a third thing, it’s really hard to disable mbstring if you don’t run your own server. Those settings are loaded when php starts. And, while you can find some advice about disabling them after the fact using your .htaccess file, the solution is less than ideal.

Besides, for a geek like me, the question remained. How can mbstring be a required extension, and millions of WordPresses run just fine on servers all over the place where mbstring is enabled by default? There had to be something else.

And indeed there is a better way!

I’ve been digging a little deeper into the mbstring extension.

There are several settings that come into play, any of which can cause the death of post-apostrophe posts in WordPress. The good news is you can change these settings in your php.ini file.

For those of us who are blogging in languages represented by utf-8 characters, here are the settings you’ll need to check (or if you’re using shared hosting where you don’t have access to the main php.ini file, create your own with a text editor and put it in your root directory with these lines):

mbstring.language = Neutral

Tells the extension to use generic (utf-8) encoding.

mbstring.internal_encoding = "UTF-8"

Tells the extension to use utf-8 for internal processes.

mbstring.http_input = auto

Tells the extension to assume that anything going in (i.e. your post submission when you hit “save”) will be in the language you told it to use in line 1.

mbstring.http_output = "UTF-8"

Tells the extension that after processing spit it all back out in utf-8 encoding.

There are a few other settings you might want to play with. And if you’re blogging in Japanese you might want to check out the settings for that. You can find advice about all this on PHP’s Mbstring Runtime Configuration page.

Now, with any luck, you can have your mbstring and your apostrophes too!

2 thoughts on “WordPress and php’s Multibyte String

  1. I believe the ‘right’ answer is to convert your db/tables/columns, where applicable, UTF8 to UTF8MB4. By doing so all those special characters that appear in Windows CP1252, but which are polluting the ISO-8859-1/UTF8 datasets, can be saved without truncating. It gets us that much closer to handling the full universe of characters.

    I’ve been running it for some time, after having a project consume waay too much time in diagnosing what was causing at fault for the crash and burn. In our case it was single right apostrophe’s, though most of those 80-90 range characters which were reserved for control codes (and instead Microsoft used for ‘extending’ the character set).

    Personally, I’m old school, preferring a single ‘ and double ” to be usable on either side and look neat doing it.

    • Thanks for this, Robb.

      No question, the UTF8MB4 is able to deal with a ton more characters. In a Microsoft environment your suggestion might indeed be the right ticket.

      I did try this at one point, but in my case, it didn’t work.

      I generally use a Mac, so I’m pretty sure it wasn’t a Microsoft encoding thing for me. Moreover, I could use the plain ' (ASCII 39) when writing a post, save it, and have it come back as ‘ (‘). Then on a subsequent save, the new character would cause the rest of the post to fall into the void. The database encoding matched the wp-config.php setting. The missing link was having (God knows why) the default php settings for mbstring set to Japanese. At least in my case, setting mbstring all back to neutral was what did the trick.

      Again, thanks for your suggestion. Hope it helps someone reading this!
      Cheers!

Comments are closed.