tabula-py now able to extract remote PDF and multiple tables at once
tabula-py is a Python library which enables you to extract tables from PDF into pandas DataFrames. Today, I released v0.8.0. In this post, I will introduce improvements after previous post of tabula-py. If you don’t familiar with tabula-py, you can see previous one.
Change Notes
- Able to read remote PDF passing URL
- [Experimental] Add
multiple_tables
mode - Add batch conversion method:
convert_into_by_batch()
- Add
encoding
option - Add
java_options
- Will deprecate
read_pdf_table()
method
I will explain important features.
Read remote PDF passing URL
If you want extract a DataFrame from the internet, you can extract remote PDF without downloading it manually.
read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/12s0324.pdf")
[Experimental] Add “multiple_tables"
mode
tabula-py is a simple wrapper of tabula-java, it was hard to handle multiple tables in a page. But now, you can extract multiple tables in a page using multiple_tables
option.
read_pdf('tests/resources/data.pdf', pages=2, multiple_tables=True)
This function create a list of DataFrames via JSON from tabula-java, so if tabula-java’s JSON format will change, the output could be broken. If you see CParserError
, try to set multiple_tables
option.
Add batch conversion method: “convert_into_by_batch()"
After tabula-java v0.9.2, we can extract tables from PDF by batch. You can use this function through convert_into_by_batch()
method.
convert_into_by_batch(path_to_dir, output_format='csv')
You should set directory path of PDFs, not the specific pdf path.
tabula-py extracts tables same directory as input files.
TODOs
There are several problems those may be fixed after releasing of tabula-java 0.9.3. e.g) Handling embedded font, including Japanese…
Waiting for your collaboration!
If you have any troubles with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share to other people. Make sure fill the issue template, it will reduce many costs for me to solve the problem.