Read PDF and Word DOC Files Using PHP
One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It’s core to their online services so it’s not as though they’re garbage files up on the server. My customer wanted their website’s search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.
Reading PDF Files
To read PDF files, you will need to install the XPDF package, which includes «pdftotext.» Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:
$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output content
Reading DOC Files
Like the PDF example above, you’ll need to download another package. This package is called Antiword. Here’s the code to grab the Word DOC content:
$content = shell_exec('/usr/local/bin/antiword '.$filename);
The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.
A special thank you to Jeremy Parrish for his help and insight with this task.
Recent Features
Creating Scrolling Parallax Effects with CSS
CSS Gradients
Incredible Demos
CSS :target
Spatial Navigation
Discussion
Cool. I wonder if there is any solution that doesn’t need extra software, i.e. a simple PHP Class or something. Maybe a future project? Anyway, this solution is pretty elegant and works well if you have full access to the server your page is on.
@Simon Sigurdhsson: I don’t know of a pure-PHP way of doing this. I know that if you’re on Windows/IIS, you can use the COM library. Other than that, these methods are the only I know of.
I’ve wanted to read DOC files before but came out with nothing that would work with my hosting since it’s shared hosting and there is no allowance for shell_exec.
I have tried this approach long back, but it doesn’t work for all PDF versions. Have you tested this with all PDF Versions?
Looks good. Shame this requires extra software though. It would be good if you could mabey parse the document into an image.
Is anyone able to provide a link to a 32 bit binary of the Linux version of antiword? I don’t have shell access to the server I work on.
i have a project which entails i search through (.pdf/ .doc) files i this development will definately come handy! thnx, hope it does work or close to doing so…
Anyone please help me out.
Currently I’m working in a project for that I have to read pdf, doc or docx file using php code from the localhost. Is it possible from localhost?Then please send me the code. Thank you,
Koushik Ghosh
That is cool, thanks for your post. You would like to post more shell tutorial. 😀 cong nguyen http://www.neoob.com/
check this out – pdf to txt in pure php – http://community.livejournal.com/php/295413.html … back in 2005. leet. 😐 That should ease some future stress.. I hope that script is still functional, as I haven’t had a chance to try myself.
I need help about how to read bangla from doc file in php.If anyone know please send me code.It is very much urgent.
How would you do this on a web hosted space, assuming that you have no access to services other than those set up for the package you buy, meaning that you can not install stuff on the remote computer?
Anyone please help me out.
Currently I’m working in a project for that I have to read and write doc or docx file using php xml code from the localhost. Is it possible from localhost?Then please send me the code or url. Thank you,
Mindaugas
Thanks for ur help to giving the instructions for how to read data from pdf files.
I did fallow the instructions whatever u have given.
it is working fine for localhost which is on windows platform.
Now i just wanted to run it on my web server…
can u tell me what r the changes do i have to do and where.
Thanks inadvance……..
@Thiru: thiru sir tell me how to store the content of word document into data base. please tell to me through my mail
antiword works like a charm from shell but i only get a slightly fucked up first line in my var if i run it from php, can u tell me what i’m doing wrong? 🙂
I got the solution :
$content = shell_exec(‘/usr/bin/antiword -f -w 0 formatting.doc’);
I forgot about -w argument, it will give you whole line or you can define value of -w as required width of line like 30, 40 etc.
$content = shell_exec("pdftotext". "sumario.pdf." ' -'); echo $content;
I wonder if anyone actually get it to work. I’ve tried but it doesn’t work for me. Dyo, did you actually get it to work? How did you do different or what kind of web server you running on? I’ve tried it on Ubuntu10.04, Apache2 with PHP5. can you share your code here? Here’s mine:
$filename = ‘/var/www/myfiles/mydoc.doc’;
$content = shell_exec(‘/usr/local/bin/antiword ‘.$filename);
echo $content;
Hello i have tried your tutorial, i tried in local host. But nothing happen, can you give me solution ??
Dude…. this is epic. I visited this page 2 days ago trying to develop a search engine for PDFs and had no clue what this meant. Now it makes sense to me and I’m going to use this. Thanks!
Hello david, I used your xpdf and antiword. xpdf is working well with my php but antiword is not executing if antiword folder is not installed on c:/antiword/bin directory. I dont want to execute from the c:/ drive i would like to run it from php from my htdocs directory but its not working. How can I do this can anyone help. Example. CODE FOR XPDF( Working ):
$page_content = shell_exec(‘C:/xampp/htdocs/search-includes/xpdfbin-win-3.03/bin32/pdftotext ‘.$filename.’ -‘); CODE FOR ANTIWORD: ( This is not working )
$page_content = shell_exec(«C:/xampp/htdocs/search-includes/antiword/bin/antiword «.$filename); CODE FOR ANTIWORD:( Working )
$page_content = shell_exec(«C:/antiword/bin/antiword «.$filename);
@Randy – search engine for PDFs Hi Randy, iF You have root access to your server, You could try Apache Tomcat with Apache SOLR and You will obtain the same effect for PDF, Word, and some other formats – should take a little time to check which formats are supported. Kind Regards,
Nick
Hello, I’d really like to use those two packages but i don’t really know how to install them ( I do have ssh access to my apache server but don’t know how to install this kind of package. )
Could you help me ? I searched a lot on the web but did not find an adequat solution and don’t wan’t to make mistakes et troubles to my system.
It works well , i am creating a mobile handler that will open PDF files even in mobile phones without downloading it actually. I tested the code by installing XPDF and open files like this Thanks again.
I tried this code
error_reporting(0);
$file=file_get_contents($_GET[‘url’]);
$file_name=rand(100000,100000000);
file_put_contents($file_name,$file);
$c=shell_exec(‘pdftotext ‘.$file_name);
header(‘Content-Type: text/plain’);
echo $c;
unlink($file_name);
How can i install antiword and XPDF on my vps server, my VPS server runs redhat
please any help appreciated …
hii.. i want to know how to install those PACKAGES in linux.. please can anyone tell me the steps to install it..
Извлечение текста из PDF файла в PHP
Порой бывает необходимо извлечь текст из PDF файла средствами PHP и далее я Вам покажу пример скрипта, который решаете данную проблему.
Устанавливаем необходимую библиотеку:
composer require smalot/pdfparser
// подключаем загрузчик
include ‘vendor/autoload.php’;
// Создаем объект для парсинга PDF
$parser = new \Smalot\PdfParser\Parser();
// парсим PDF файл
$pdf = $parser->parseFile(‘technic_report.pdf’);
// выводим текст из файла
print $pdf -> getText();
Обратите внимание на то, что текст, который Вы получите из pdf файла не будет иметь исходного форматирования документа. Однако это не так уж и важно, чтобы извлечь из текста интересующие Вас данные.
Если в PDF файле несколько страниц, то можно пройтись по каждой странице по отдельности:
// ссылка из PDF
// Извлекаем все страницы из PDF файла
$pages = $pdf->getPages();
// проходимся по каждой странице и получаем текст
foreach ($pages as $page) echo $page->getText();
>
А здесь можно получить метаданные PDF файла:
// извлекаем метаданные из pdf файла
$details = $pdf -> getDetails();
// Проходимся по каждому значению.
foreach ($details as $property => $value) if (is_array($value)) $value = implode(‘, ‘, $value);
>
echo $property . ‘ => ‘ . $value . «\n»;
>
Вот так просто можно, например, автоматизировать обработку большого количества PDF файлов в PHP, извлекая из них необходимые данные.
Создано 14.05.2019 08:56:04
Копирование материалов разрешается только с указанием автора (Михаил Русаков) и индексируемой прямой ссылкой на сайт (http://myrusakov.ru)!
Добавляйтесь ко мне в друзья ВКонтакте: http://vk.com/myrusakov.
Если Вы хотите дать оценку мне и моей работе, то напишите её в моей группе: http://vk.com/rusakovmy.
Если Вы не хотите пропустить новые материалы на сайте,
то Вы можете подписаться на обновления: Подписаться на обновления
Если у Вас остались какие-либо вопросы, либо у Вас есть желание высказаться по поводу этой статьи, то Вы можете оставить свой комментарий внизу страницы.
Порекомендуйте эту статью друзьям:
Если Вам понравился сайт, то разместите ссылку на него (у себя на сайте, на форуме, в контакте):
- Кнопка:
Она выглядит вот так: - Текстовая ссылка:
Она выглядит вот так: Как создать свой сайт - BB-код ссылки для форумов (например, можете поставить её в подписи):
Комментарии ( 0 ):
Для добавления комментариев надо войти в систему.
Если Вы ещё не зарегистрированы на сайте, то сначала зарегистрируйтесь.
Copyright © 2010-2023 Русаков Михаил Юрьевич. Все права защищены.