Technology - Lesson 3ii - Python - VAT Return PDFs to VAT Summary
Introduction
This lesson covers how to convert VAT return PDF files into a summarized VAT report using Python. In this tutorial, we will use several Python libraries to read, process, and summarize data from VAT return PDFs, outputting it into an Excel file for further analysis. This process is valuable for finance teams who need to streamline VAT reporting without manual data entry.
We will use Google Colab to run this code, which is convenient for this type of project because no local installation is needed. However, remember that Google Colab provides a temporary environment, so libraries need to be reinstalled each session.
Libraries Needed
camelot-py: For extracting tables from PDFs.PyPDF2: To handle PDF file reading and manipulation.pandas: For data manipulation and cleaning.
To install these libraries in Google Colab, use the following commands:
!pip install "camelot-py[cv]" -q
!apt-get install ghostscript -q
!pip install "PyPDF2~=2.0" -q
Uploading VAT PDF Files
Upload the VAT return PDF files to your Colab environment. It’s helpful to name the files with a prefix such as 01, 02, 03, etc., so they are ordered by month in descending order. Store the files in a folder outside of "Sample Data" for easy access.
Code Explanation
Below is the code for processing VAT PDF files and generating a summary. This code is designed to be run in Google Colab, and each section is explained in detail.
1. Install Required Libraries
!pip install "camelot-py[cv]" -q
!apt-get install ghostscript -q
!pip install "PyPDF2~=2.0" -q
2. Import Libraries
After installing, import the necessary libraries:
import camelot as cam
import glob
import pandas as pd
3. Define Categories for VAT Summary
Define the categories that will be used in the VAT summary. These are the headers for different types of transactions found in VAT reports:
cat = ['Particulars',
'Sales',
'Taxable Sales',
'Export',
'Non-Taxable-sales',
'Purchase/Import',
'Taxable Purchase',
'Taxable Import',
'Non-Taxable Purchase',
'Non-Taxable Import' ,
'Others',
'Other Adjustment',
'Total',
'Debit-Credit',
'Previous Month VAT Credit',
'Net VAT Payable',
'VAT Refund Claim',
'VAT Refund Claim Basis']
4. Upload and Process VAT PDFs
Process each VAT PDF file, extract data, and adjust values as needed:
# Process each PDF file
df = pd.DataFrame(cat)
for filenames in sorted(glob.glob('*.pdf')):
print(filenames)
table = cam.read_pdf(filenames, pages = '1', flavor = 'lattice', encoding = 'utf-8')
tab = table[0].df
# Adjust extracted data
try:
tab.iloc[6,-2] = tab.iloc[6,3]
except:
tab.iloc[6,-2] = 0
try:
tab.iloc[7,-2] = tab.iloc[7,3]
except:
tab.iloc[7,-2] = 0
# Additional adjustments and reading for credit table
try:
credit_table = cam.read_pdf(filenames, pages = '1', flavor = 'stream', encoding = 'utf-8')[0].df
credit_amount = credit_table.iloc[15,4]
except:
credit_amount = ""
vat_claim = credit_table.iloc[15,3]
tab.loc[14,2] = credit_amount
tab.loc[17,2] = vat_claim
df['{} Gross Amount'.format(filenames[:7])],df['{} Tax'.format(filenames[:7])] = [tab[2],tab.iloc[:,-2]]
5. Clean and Structure Data
Clean and format the extracted data, then separate it into gross amount and VAT amount dataframes:
# Clean and structure data
df = df.replace('^0 .*',0, regex=True)
df = df.replace('_vat_return_work','')
df = df.fillna(0)
df = df.rename(columns={0: 'Particulars'})
df1 = df.iloc[1:].T
# Separate gross and VAT data
df_gross = df1.filter(like="Gross Amount",axis=0)
df_vat = df1.filter(like="Tax",axis=0)
# Clean data
df_gross.columns = cat[1:]
df_gross = df_gross.add_suffix('_amount')
df_vat.columns = cat[1:]
df_vat = df_vat.add_suffix('_vat')
6. Combine Data and Save to Excel
Combine gross and VAT data, clean it further, and save the final summary to an Excel file:
# Combine and save
result = pd.concat([df_gross,df_vat], axis=1)
result = result.reindex(sorted(result.columns), axis=1).fillna(0)
cols = result.columns
result[cols] = result[cols].apply(pd.to_numeric, errors='coerce')
result.to_excel('{}_{}.xlsx'.format(output_name, fiscal_year))
print("Complete !!! Download the saved file")
7. Clean Up PDF Files
Optionally, delete the PDF files after processing:
import os
for file in glob.glob("*.pdf"):
os.remove(file)
Conclusion
This tutorial demonstrated a process to convert VAT return PDFs into a summarized Excel report using Python and Google Colab. The provided code allows for the extraction, cleaning, and structuring of VAT data, making it easier to analyze and report on monthly VAT activities.
In upcoming lessons, we’ll expand on this process with additional finance-related automations and data analyses.