Skip to content

Added a count-words feature #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions PDF to text/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,24 @@ def convert_pdf_to_txt(pdf_path, save_to_file=True, output_folder="output_texts"
except Exception as e:
print(f"Error processing {pdf_path}: {e}")


def count_words_in_pdf(pdf_path):
try:
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()

# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re module is used but not imported. This will cause a NameError at runtime.

Copilot uses AI. Check for mistakes.

return len(words)
Comment on lines +71 to +80
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function duplicates text extraction logic already present in the existing extract_text_from_pdf function. Consider reusing the existing function to avoid code duplication.

Suggested change
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
# Reuse extract_text_from_pdf to get the text
text = extract_text_from_pdf(pdf_path)
if text is None:
return "Error: Could not extract text from the PDF."
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)

Copilot uses AI. Check for mistakes.

except FileNotFoundError:
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
Comment on lines +70 to +84
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: the function uses 7 spaces instead of the standard 4 spaces used elsewhere in the file.

Suggested change
try:
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
except FileNotFoundError:
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
try:
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Remove extra whitespaces and split into words
words = re.findall(r'\b\w+\b', text.lower())
return len(words)
except FileNotFoundError:
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"

Copilot uses AI. Check for mistakes.

Comment on lines +82 to +84
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns different types (integer for success, string for errors). This inconsistent return type makes error handling difficult. Consider raising exceptions or returning a consistent data structure.

Suggested change
return "Error: PDF file not found."
except Exception as e:
return f"An error occurred: {e}"
raise FileNotFoundError(f"PDF file not found: {pdf_path}")
except Exception as e:
raise RuntimeError(f"An error occurred while processing the PDF: {e}")

Copilot uses AI. Check for mistakes.


# Example usage:

#example pdf from internet
Expand All @@ -75,3 +93,5 @@ def convert_pdf_to_txt(pdf_path, save_to_file=True, output_folder="output_texts"

# Convert PDF to text and save the cleaned text to a file
convert_pdf_to_txt(pdf)
word_count = count_words_in_pdf(pdf)
print(f"Total word count in the PDF: {word_count}")