Skip to content

Added a count-words feature #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rutvishah22
Copy link

PR Title

Add feature: Count words in extracted text

Summary

This PR adds a new feature to count the number of words in the extracted text.

Description

Users can now see the total word count for the text that is extracted from the PDF.
This helps provide more context and basic stats about the document.

Thank you,

Rutvi

@DhanushNehru DhanushNehru requested a review from Copilot July 21, 2025 15:26
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a word counting feature to the PDF text extraction tool. The new functionality allows users to see the total word count for text extracted from PDF files, providing additional context and basic statistics about the document.

  • Adds a new count_words_in_pdf() function that extracts text and counts words using regex
  • Integrates word counting into the example usage section to demonstrate the feature

Comment on lines +70 to +84
try:
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()

# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
except FileNotFoundError:
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: the function uses 7 spaces instead of the standard 4 spaces used elsewhere in the file.

Suggested change
try:
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
except FileNotFoundError:
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
try:
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
except FileNotFoundError:
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"

Copilot uses AI. Check for mistakes.

Comment on lines +71 to +80
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()

# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function duplicates text extraction logic already present in the existing extract_text_from_pdf function. Consider reusing the existing function to avoid code duplication.

Suggested change
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
# Reuse extract_text_from_pdf to get the text
text = extract_text_from_pdf(pdf_path)
if text is None:
return "Error: Could not extract text from the PDF."
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)

Copilot uses AI. Check for mistakes.

Comment on lines +82 to +84
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns different types (integer for success, string for errors). This inconsistent return type makes error handling difficult. Consider raising exceptions or returning a consistent data structure.

Suggested change
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
raise FileNotFoundError(f"PDF file not found: {pdf_path}")
except Exception as e:
raise RuntimeError(f"An error occurred while processing the PDF: {e}")

Copilot uses AI. Check for mistakes.

text += page.extract_text()

# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re module is used but not imported. This will cause a NameError at runtime.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant