Last modified: 2012-05-07 21:22:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34064, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 32064 - Text extraction is not encoded in utf-8


Summary:	Text extraction is not encoded in utf-8

Status:	RESOLVED DUPLICATE of bug 35122

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	PdfHandler (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-10-30 23:33 UTC by Philippe Elie
Modified:	2012-05-07 21:22 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Philippe Elie 2011-10-30 23:33:36 UTC

extensions/PdfHandler.image.php fails to extract pdf text to utf-8, see http://fr.wikisource.org/w/index.php?title=Page:Journal_des_d%C3%A9bats,_7_d%C3%A9cembre_1820.pdf/1&action=edit&redlink=1 retrieveMetaData() force utf-8 output encoding but only for metadata, this is not done for the text itself. For some reason, it look like pdftotext installed on the cluster doesn't use utf-8 as default output encoding. (note for Pdf there is no internal encoding as text is encoded as draw command using the currently selected font)

Comment 1 Beau 2012-05-07 21:22:55 UTC

Bug 35122 has more details.

*** This bug has been marked as a duplicate of bug 35122 ***

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links