Technology - Lesson 3ii - Python - VAT Return PDFs to VAT Summary

Python VAT PDF to Excel

Introduction

This lesson covers how to convert VAT return PDF files into a summarized VAT report using Python. In this tutorial, we will use several Python libraries to read, process, and summarize data from VAT return PDFs, outputting it into an Excel file for further analysis. This process is valuable for finance teams who need to streamline VAT reporting without manual data entry.

We will use Google Colab to run this code, which is convenient for this type of project because no local installation is needed. However, remember that Google Colab provides a temporary environment, so libraries need to be reinstalled each session.

Libraries Needed

  • camelot-py: For extracting tables from PDFs.
  • PyPDF2: To handle PDF file reading and manipulation.
  • pandas: For data manipulation and cleaning.

To install these libraries in Google Colab, use the following commands:

!pip install "camelot-py[cv]" -q
!apt-get install ghostscript -q
!pip install "PyPDF2~=2.0" -q

Uploading VAT PDF Files

Upload the VAT return PDF files to your Colab environment. It’s helpful to name the files with a prefix such as 01, 02, 03, etc., so they are ordered by month in descending order. Store the files in a folder outside of "Sample Data" for easy access.

Code Explanation

Below is the code for processing VAT PDF files and generating a summary. This code is designed to be run in Google Colab, and each section is explained in detail.

1. Install Required Libraries

!pip install "camelot-py[cv]" -q
!apt-get install ghostscript -q
!pip install "PyPDF2~=2.0" -q

2. Import Libraries

After installing, import the necessary libraries:

import camelot as cam
import glob
import pandas as pd

3. Define Categories for VAT Summary

Define the categories that will be used in the VAT summary. These are the headers for different types of transactions found in VAT reports:

cat = ['Particulars',
       'Sales',
       'Taxable Sales',
       'Export',
       'Non-Taxable-sales',
       'Purchase/Import',
       'Taxable Purchase',
       'Taxable Import',
       'Non-Taxable Purchase',
       'Non-Taxable Import' ,
       'Others',
       'Other Adjustment',
       'Total',
       'Debit-Credit',
       'Previous Month VAT Credit',
       'Net VAT Payable',
       'VAT Refund Claim',
       'VAT Refund Claim Basis']

4. Upload and Process VAT PDFs

Process each VAT PDF file, extract data, and adjust values as needed:

# Process each PDF file
df = pd.DataFrame(cat)
for filenames in sorted(glob.glob('*.pdf')):
  print(filenames)
  table = cam.read_pdf(filenames, pages = '1', flavor = 'lattice', encoding = 'utf-8')
  tab = table[0].df
  # Adjust extracted data
  try:
    tab.iloc[6,-2] = tab.iloc[6,3]
  except:
    tab.iloc[6,-2] = 0
  try:
    tab.iloc[7,-2] = tab.iloc[7,3]
  except:
    tab.iloc[7,-2] = 0
  # Additional adjustments and reading for credit table
  try:
    credit_table = cam.read_pdf(filenames, pages = '1', flavor = 'stream', encoding = 'utf-8')[0].df
    credit_amount = credit_table.iloc[15,4]
  except:
    credit_amount = ""
  vat_claim = credit_table.iloc[15,3]
  tab.loc[14,2] = credit_amount
  tab.loc[17,2] = vat_claim
  df['{} Gross Amount'.format(filenames[:7])],df['{} Tax'.format(filenames[:7])] = [tab[2],tab.iloc[:,-2]]

5. Clean and Structure Data

Clean and format the extracted data, then separate it into gross amount and VAT amount dataframes:

# Clean and structure data
df = df.replace('^0 .*',0, regex=True)
df = df.replace('_vat_return_work','')
df =  df.fillna(0)
df = df.rename(columns={0: 'Particulars'})
df1 = df.iloc[1:].T

# Separate gross and VAT data
df_gross = df1.filter(like="Gross Amount",axis=0)
df_vat = df1.filter(like="Tax",axis=0)

# Clean data
df_gross.columns =  cat[1:]
df_gross = df_gross.add_suffix('_amount')
df_vat.columns =  cat[1:]
df_vat = df_vat.add_suffix('_vat')

6. Combine Data and Save to Excel

Combine gross and VAT data, clean it further, and save the final summary to an Excel file:

# Combine and save
result = pd.concat([df_gross,df_vat], axis=1)
result = result.reindex(sorted(result.columns), axis=1).fillna(0)
cols = result.columns
result[cols] = result[cols].apply(pd.to_numeric, errors='coerce')

result.to_excel('{}_{}.xlsx'.format(output_name, fiscal_year))
print("Complete !!! Download the saved file")

7. Clean Up PDF Files

Optionally, delete the PDF files after processing:

import os
for file in glob.glob("*.pdf"):
  os.remove(file)

Conclusion

This tutorial demonstrated a process to convert VAT return PDFs into a summarized Excel report using Python and Google Colab. The provided code allows for the extraction, cleaning, and structuring of VAT data, making it easier to analyze and report on monthly VAT activities.

In upcoming lessons, we’ll expand on this process with additional finance-related automations and data analyses.