Last modified: 2014-11-20 09:25:46 UTC
Created attachment 10763 [details] iconv-test.c Our test IPTCTest::testIPTCParseForcedUTFButInvalid verifies that when feeding image metadata marked as UTF-8 but with non-UTF-8 bytes, the bad bytes will be dropped and the sane UTF-8 kept. This was the behavior of iconv() in php < 5.4 as can be tested with var_dump( iconv("UTF-8", "UTF-8//IGNORE", "\xC3\xC3\xC3\xB8") ); The behavior of iconv(3) (with IGNORE) is to provide the good bytes *and* report the error. That can be tested with the attached program. The fact that when not using IGNORE, the were returned was reported as a bug in https://bugs.php.net/52211 and fixed in e3fdf3 by always returning an empty string. So our parsing of IPTC data is now different (wrong?) on PHP 5.4 We can: - Set the empty string as the correct output (remove/change the test) - Verify UTF-8 correctness ourselves (using UtfNormal::cleanUp() seems the appropiate one, we could then remove utf-8 replacement char if a slient skip is really desired). - Request php iconv() behavior to change back / add a new flag.
*** Bug 67908 has been marked as a duplicate of this bug. ***
It looks to me like the real problem is described in <https://bugs.php.net/bug.php?id=48147> and the upstream-upstream bug at <https://sourceware.org/bugzilla/show_bug.cgi?id=13541>. Apparently glibc's iconv implementation deviates from the documented API of libiconv. Unfortunately the fix that was suggested to PHP to work around the glibc bug has not been implemented.
Change 172101 had a related patch set uploaded by BryanDavis: Avoid glibc iconv bug by using mb_convert_encoding https://gerrit.wikimedia.org/r/172101
*** Bug 73178 has been marked as a duplicate of this bug. ***