Skip to content
This repository was archived by the owner on Mar 31, 2024. It is now read-only.

QingTian1927/SearTxT-and-Texter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearTxT & Texter

  • 2023-09-02: SearTxT-and-Texter won't be receiving any updates for a long while.
  • 2024-03-31: SearTxT-and-Texter is now officially archived as I no longer code primarily in Python.

These are the improved versions of my original Python Text Searcher, which was unnecessarily bloated, to say the least.

  • SearTxT is a simple command-line tool to search for virtually any string of text contained within .txt files in a user-specified directory.

  • Texter is a complementary file converter that can convert .docx, .pdf as well as several other file formats into .txt for use with SearTxT.

I wrote these programs mainly to learn the basics of Python (and also for fun), so don't expect the same level of polish and utility that may come with tools such as fzf or grep. With that said though, I still hope that you would find SearTxT & Texter to be useful somehow.

With Special Thanks To

These wonderful people have provided invaluable help and support in the creation of SearTxT & Texter:

  • Master Harry Dreamer: testing & bug reporting
  • Master Eltidee: testing
  • OBP Corp: listening to my perpetual ramblings about the benefits of Free and Open Source Software

Table of Contents

  1. Features
  2. Installation
  3. Usage
  4. Quickstart
  5. Conversion
  6. Building From Source
  7. Known Issues

Features

  • 8 times the performance improvement :0
  • Support for .docx, .pdf, .doc, and many more (See the Conversion section for more details)
  • Automate boring, repetitive tasks with AutoScript (Coming soon)
  • Much eye candy >.<
  • And many more (probably...)

Installation

Simply download the latest release, extract the contents of the .zip archive, and launch SearTxT or Texter with the appropriate executable.

Note: Some anti-virus programs may falsely flag the executable as a virus and then quarantine it. To avoid this, you can either add the SearTxT directory as an exception, or completely disable the anti-virus software (not recommended)

Texter-specific Requirements:

Texter can only convert .docx files with the pandoc runtime installed, so make sure you download it using the /pd command before starting the conversion process.

Note: Should the /pd command fails for any reason, you can download pandoc directly from the official website and install it manually.

Running From Source

If you want to run SearTxT or Texter directly with the Python Interpreter, make sure that your system satisfies the following requirements:

  • Python >= 3.10.8
  • Pip packages (Texter): pypandoc, pdfminer.six

Linux

Simply download Python from your package manager of choice, e.g.:

APT

sudo apt install python-is-python3

Pacman

sudo pacman -S python

Windows

You can either download Python from the official website or install it with a package manager such as scoop:

scoop install python

Pip

Once you have installed Python and added the installation directory to your PATH, download the required packages with:

python -m pip install pypandoc pdfminer.six

You may also have to upgrade pip first:

python -m pip install --upgrade pip

Usage

Usage : /command <required parameters> [optional parameters]
 or   : <any string of characters> (SearTxT only)

Ex    : /ls
        /ls 4 -s or /ls 4 --script
        /t -h    or /t --half

General Commands

Change the target directory:

/cd [path] 

(default: script directory)

Platform-specific path separator:

/cd /home/DBVG/Documents     # UNIX absolute path
    C:\Users\DBVG\Documents  # Windows absolute path
    
/cd ../../example            # UNIX relative path
    ..\..\example            # Windows relative path

Quickly change to the script directory:

/cd ~ or just /cd        # To quickly change to the script directory

/cd ~/example1/example2  # To quickly change to a folder inside the script directory

Traverse relative paths:

/cd example               # Enter the specified folder in the current directory

/cd ..                    # To go up a directory

/cd ../../..              # To go up a number of directories

/cd ../example1/example2  # To go up a number of directories and enter the specified directory

List the contents of a directory:

/ls [column: num > 0] [dir: -s / --script ; -t / --target]

(default: target dir, 3 columns)

Configure the number of CPUs used for the multi-threaded processes:

/t [threads: num ; -a / --all ; -h / --half ; -q / --quarter]

(default: all threads)

Misc:

* /c             : clear the display

* /h             : show all available commands

* /q or <CTRL-C> : terminate the program

SearTxT Commands

Change the search method:

/mt [method: -e / --exact ; -p / --proximity]

(default: exact match)

Change the minimum confidence score for approximate matches:

/s [score: 0 < float < 1]

(default: 0.85)

Texter Commands

Start the conversion process:

/cv [verbosity: -v / --verbose ; -b / --brief]

(default: brief final output)

Download and install pandoc:

/pd

Cat:

/cat

Quickstart

General Guide

SearTxT and Texter are command-line tools, so you will need to type out the exact command of the desired operation.

Configure the target directory

First, start by configuring your target directory. This is where SearTxT and Texter will try to search and convert your files respectively.

Example (Windows file path):

*****DBVG SearTxT ver 1.0*****
Script directory: D:\Downloads\SearTxT 
Search method: exact match

[SearTxT ~example]$ /cd C:\Users\DBVG\Documents\PDF Stuff
<ENTER>
.......
[SearTxT C:\Users\DBVG\Documents\PDF Stuff]$ _

Example (Unix file path):

*****DBVG Texter ver 1.0*****
Script directory: /home/DBVG/SearTxT

[Texter ~example]$ /cd /home/DBVG/Documents/DOCX Stuff
<ENTER>
.......
[Texter /home/DBVG/Documents/DOCX Stuff]$ _

To check whether you have entered the correct path, use the /ls command to check the content of the target directory.

Note: The ~ symbol indicates that the current target directory is inside the script directory, hence a relative path.

Texter Quickstart Guide

Since SearTxT can only search for strings in .txt files, you will have to run Texter first to convert other file formats (e.g. .docx) into .txt.

Download and install pandoc

Simply launch Texter and run the /pd command. Alternatively, you can also download and install pandoc manually, but make sure you add the installation directory to your PATH.

Start the conversion

If you have correctly set up and moved your files inside the target directory, simply start the conversion by using the /cv command.

Note: Make sure to BACKUP your files as Texter PERMANENTLY DELETES the original file formats after the conversion.

Check the results

Simply navigate to the specified target directory and check the newly-converted .txt files with your favorite text editor.

SearTxT Quickstart Guide

This guide assumes that you have already converted your files into .txt files.

Set the search method

To search for exact matches of the original query:

/mt -e

or

/mt --exact

To search for approximate matches of the original query:

/mt -p

or

/mt --proximity

Start searching

Simply type in virtually any string of characters and then hit ENTER.

Note: The search query cannot start with the / character. If your search query starts with /, SearTxT will throw an error message.

Check the results

If SearTxT finds any matches, it will print out the results on the screen. Simply use your mouse to scroll through the result list.

Conversion

As of version 1.0. Texter officially supports .docx and .pdf files. However, conversion from .pdf to plain text, especially from files with a large number of non-Latin characters, can be rather unreliable as it can break the formatting of the original documents.

Unofficially, Texter by default can also try to convert the following file formats:

-----------------------------------------
.css .sass .html .htm .js .jsm .mjs .json
.markdown .md .mkd .org
.v .asc .log .conf
.doc
.py .py3 .pyi .pyx .py3x .wsgi
.rs .vbs .lua .p .pas .kt .java
.c .C .cs .c++ .cc .cpp .cxx
.lisp .go .hs
-----------------------------------------

It accomplishes this by reading these files in plain text mode, and then copying the contents to a separate .txt file (very ingenious, ikr). If you want additional file formats, simply add them to unsupported_types.conf

Building From Source

If you feel like compiling your own executables, you can theoretically do so with any compatible CPython compilers. Though the official releases were compiled with Nuitka, this section will provide instructions for Nuitka and PyInstaller.

Note: I don't recommend building SearTxT or running the SearTxT binary on a Linux system due to potential memory leaks and just being very buggy in general.

With Nuitka

Prerequisites

  • Nuitka >= 1.3.6
  • Python >= 3.10
  • Pip packages (Nuitka): ordered-set, zstandard
  • Pip packages (Texter): pypandoc, pdfminer.six

Windows:

  • MSVC v143 - VS2022 C++ x64/x86 build tools (Latest)
  • Windows 11 SDK

Note: Python must NOT be installed from the Windows app store.

(Arch) Linux:

  • gcc
  • patchelf
  • ccache

Please refer to the Nuitka User Manual for more information.

Instructions

Clone the repository

Simply download the latest Source code archive and extract the contents. Alternatively, if you have git installed, use the following command:

git clone https://github.com/QingTian1927/SearTxT-and-Texter

Install Nuitka

python -m pip install nuitka

Building SearTxT

Open the extracted Source code directory in the command line and run:

python -m nuitka --standalone --onefile --remove-output <source_file>

Example (Windows):

python -m nuitka --standalone --onefile --remove-output --windows-icon-from-ico=<icon_path> <source_file>

If you have correctly configured everything, Nuitka should produce an executable within the same directory (SearTxT.exe on Windows, SearTxT.bin on Linux)

Known Issues

Repeating arguments

When there are several parameters for an argument, commands that accept multiple arguments (e.g. /ls) will only use the latest parameter in the series:

[SearTxT ~example]$ /ls 2 3 -s -a
file1.txt  file2.txt  file3.txt
file4.txt  file5.txt  file6.txt
file7.txt  ...        ...

Discovered by Master Harry Dreamer

Solution: just don't repeat the same arguments. I won't fix this issue in the foreseeable future because I simply don't think it is enough of a problem yet. I may eventually fix it though.

False positives

Some anti-virus providers may falsely flag the SearTxT & Texter executables as viruses and then quarantine them.

Solution: add an exception to the anti-virus program or disable it completely (not recommended)

About

A simple text searcher & its complimentary DOCX/PDF-to-TXT converter written in Python

Topics

Resources

License

Stars

Watchers

Forks

Languages