-
-
Notifications
You must be signed in to change notification settings - Fork 345
Added a count-words feature #447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a word counting feature to the PDF text extraction tool. The new functionality allows users to see the total word count for text extracted from PDF files, providing additional context and basic statistics about the document.
- Adds a new
count_words_in_pdf()
function that extracts text and counts words using regex - Integrates word counting into the example usage section to demonstrate the feature
try: | ||
with open(pdf_path, 'rb') as pdf_file: | ||
pdf_reader = PyPDF2.PdfReader(pdf_file) | ||
text = "" | ||
for page_num in range(len(pdf_reader.pages)): | ||
page = pdf_reader.pages[page_num] | ||
text += page.extract_text() | ||
|
||
# Remove extra whitespaces and split into words | ||
words = re.findall(r'\b\w+\b', text.lower()) | ||
return len(words) | ||
except FileNotFoundError: | ||
return "Error: PDF file not found." | ||
except Exception as e: | ||
return f"An error occurred: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: the function uses 7 spaces instead of the standard 4 spaces used elsewhere in the file.
try: | |
with open(pdf_path, 'rb') as pdf_file: | |
pdf_reader = PyPDF2.PdfReader(pdf_file) | |
text = "" | |
for page_num in range(len(pdf_reader.pages)): | |
page = pdf_reader.pages[page_num] | |
text += page.extract_text() | |
# Remove extra whitespaces and split into words | |
words = re.findall(r'\b\w+\b', text.lower()) | |
return len(words) | |
except FileNotFoundError: | |
return "Error: PDF file not found." | |
except Exception as e: | |
return f"An error occurred: {e}" | |
try: | |
with open(pdf_path, 'rb') as pdf_file: | |
pdf_reader = PyPDF2.PdfReader(pdf_file) | |
text = "" | |
for page_num in range(len(pdf_reader.pages)): | |
page = pdf_reader.pages[page_num] | |
text += page.extract_text() | |
# Remove extra whitespaces and split into words | |
words = re.findall(r'\b\w+\b', text.lower()) | |
return len(words) | |
except FileNotFoundError: | |
return "Error: PDF file not found." | |
except Exception as e: | |
return f"An error occurred: {e}" |
Copilot uses AI. Check for mistakes.
with open(pdf_path, 'rb') as pdf_file: | ||
pdf_reader = PyPDF2.PdfReader(pdf_file) | ||
text = "" | ||
for page_num in range(len(pdf_reader.pages)): | ||
page = pdf_reader.pages[page_num] | ||
text += page.extract_text() | ||
|
||
# Remove extra whitespaces and split into words | ||
words = re.findall(r'\b\w+\b', text.lower()) | ||
return len(words) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function duplicates text extraction logic already present in the existing extract_text_from_pdf
function. Consider reusing the existing function to avoid code duplication.
with open(pdf_path, 'rb') as pdf_file: | |
pdf_reader = PyPDF2.PdfReader(pdf_file) | |
text = "" | |
for page_num in range(len(pdf_reader.pages)): | |
page = pdf_reader.pages[page_num] | |
text += page.extract_text() | |
# Remove extra whitespaces and split into words | |
words = re.findall(r'\b\w+\b', text.lower()) | |
return len(words) | |
# Reuse extract_text_from_pdf to get the text | |
text = extract_text_from_pdf(pdf_path) | |
if text is None: | |
return "Error: Could not extract text from the PDF." | |
# Remove extra whitespaces and split into words | |
words = re.findall(r'\b\w+\b', text.lower()) | |
return len(words) |
Copilot uses AI. Check for mistakes.
return "Error: PDF file not found." | ||
except Exception as e: | ||
return f"An error occurred: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns different types (integer for success, string for errors). This inconsistent return type makes error handling difficult. Consider raising exceptions or returning a consistent data structure.
return "Error: PDF file not found." | |
except Exception as e: | |
return f"An error occurred: {e}" | |
raise FileNotFoundError(f"PDF file not found: {pdf_path}") | |
except Exception as e: | |
raise RuntimeError(f"An error occurred while processing the PDF: {e}") |
Copilot uses AI. Check for mistakes.
text += page.extract_text() | ||
|
||
# Remove extra whitespaces and split into words | ||
words = re.findall(r'\b\w+\b', text.lower()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The re
module is used but not imported. This will cause a NameError at runtime.
Copilot uses AI. Check for mistakes.
PR Title
Add feature: Count words in extracted text
Summary
This PR adds a new feature to count the number of words in the extracted text.
Description
Users can now see the total word count for the text that is extracted from the PDF.
This helps provide more context and basic stats about the document.
Thank you,
Rutvi