Added a count-words feature #447

rutvishah22 · 2025-06-07T08:06:24Z

PR Title

Add feature: Count words in extracted text

Summary

This PR adds a new feature to count the number of words in the extracted text.

Description

Users can now see the total word count for the text that is extracted from the PDF.
This helps provide more context and basic stats about the document.

Thank you,

Rutvi

Copilot

Pull Request Overview

This PR adds a word counting feature to the PDF text extraction tool. The new functionality allows users to see the total word count for text extracted from PDF files, providing additional context and basic statistics about the document.

Adds a new count_words_in_pdf() function that extracts text and counts words using regex
Integrates word counting into the example usage section to demonstrate the feature

Copilot · 2025-07-21T15:26:58Z

PDF to text/script.py

+       try:
+           with open(pdf_path, 'rb') as pdf_file:
+               pdf_reader = PyPDF2.PdfReader(pdf_file)
+               text = ""
+               for page_num in range(len(pdf_reader.pages)):
+                   page = pdf_reader.pages[page_num]
+                   text += page.extract_text()
+
+               # Remove extra whitespaces and split into words
+               words = re.findall(r'\b\w+\b', text.lower())
+               return len(words)
+       except FileNotFoundError:
+           return "Error: PDF file not found."
+       except Exception as e:
+           return f"An error occurred: {e}"


Inconsistent indentation: the function uses 7 spaces instead of the standard 4 spaces used elsewhere in the file.

Suggested change

try:

with open(pdf_path, 'rb') as pdf_file:

pdf_reader = PyPDF2.PdfReader(pdf_file)

text = ""

for page_num in range(len(pdf_reader.pages)):

page = pdf_reader.pages[page_num]

text += page.extract_text()

# Remove extra whitespaces and split into words

words = re.findall(r'\b\w+\b', text.lower())

return len(words)

except FileNotFoundError:

return "Error: PDF file not found."

except Exception as e:

return f"An error occurred: {e}"

try:

with open(pdf_path, 'rb') as pdf_file:

pdf_reader = PyPDF2.PdfReader(pdf_file)

text = ""

for page_num in range(len(pdf_reader.pages)):

page = pdf_reader.pages[page_num]

text += page.extract_text()

# Remove extra whitespaces and split into words

words = re.findall(r'\b\w+\b', text.lower())

return len(words)

except FileNotFoundError:

return "Error: PDF file not found."

except Exception as e:

return f"An error occurred: {e}"

Copilot · 2025-07-21T15:26:59Z

PDF to text/script.py

+           with open(pdf_path, 'rb') as pdf_file:
+               pdf_reader = PyPDF2.PdfReader(pdf_file)
+               text = ""
+               for page_num in range(len(pdf_reader.pages)):
+                   page = pdf_reader.pages[page_num]
+                   text += page.extract_text()
+
+               # Remove extra whitespaces and split into words
+               words = re.findall(r'\b\w+\b', text.lower())
+               return len(words)


The function duplicates text extraction logic already present in the existing extract_text_from_pdf function. Consider reusing the existing function to avoid code duplication.

Suggested change

with open(pdf_path, 'rb') as pdf_file:

pdf_reader = PyPDF2.PdfReader(pdf_file)

text = ""

for page_num in range(len(pdf_reader.pages)):

page = pdf_reader.pages[page_num]

text += page.extract_text()

# Remove extra whitespaces and split into words

words = re.findall(r'\b\w+\b', text.lower())

return len(words)

# Reuse extract_text_from_pdf to get the text

text = extract_text_from_pdf(pdf_path)

if text is None:

return "Error: Could not extract text from the PDF."

# Remove extra whitespaces and split into words

words = re.findall(r'\b\w+\b', text.lower())

return len(words)

Copilot · 2025-07-21T15:26:59Z

PDF to text/script.py

+           return "Error: PDF file not found."
+       except Exception as e:
+           return f"An error occurred: {e}"


The function returns different types (integer for success, string for errors). This inconsistent return type makes error handling difficult. Consider raising exceptions or returning a consistent data structure.

Suggested change

return "Error: PDF file not found."

except Exception as e:

return f"An error occurred: {e}"

raise FileNotFoundError(f"PDF file not found: {pdf_path}")

except Exception as e:

raise RuntimeError(f"An error occurred while processing the PDF: {e}")

Copilot · 2025-07-21T15:26:59Z

PDF to text/script.py

+                   text += page.extract_text()
+
+               # Remove extra whitespaces and split into words
+               words = re.findall(r'\b\w+\b', text.lower())


The re module is used but not imported. This will cause a NameError at runtime.

Added a count-words feature

6dc8318

DhanushNehru requested a review from Copilot July 21, 2025 15:26

Copilot AI reviewed Jul 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Added a count-words feature #447

Added a count-words feature #447

Uh oh!

rutvishah22 commented Jun 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Added a count-words feature #447

Are you sure you want to change the base?

Added a count-words feature #447

Uh oh!

Conversation

rutvishah22 commented Jun 7, 2025

PR Title

Summary

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!