Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session

Hi this the logical progression, next step, to the clean-up process of data extracted from a database having an ID and name field. It's a follow on from the use case documented in the [first String Grouper tutorial](https://github.com/Bergvca/string_grouper/blob/master/tutorials/tutorial_1.md) which I wrote.

## Background
To recap the requirement back then — there was a reasonably large database of accounts with an ID and customer name. Browsing the database it could be seen by eye that there were many, many duplicates. Using String Grouper (with the added functionality to include IDs in the matching process) it is possible to (insanely) quickly get a list of IDs of duplicate records which are then imported back into the database to do a join with the original table. It's all in the tutorial.

The persons responsible are now starting the clean up necessary to remove all the duplicates, but now I want to setup a solution to ensure that when bulk imports are done (typically using Excel) in the future to this system that duplicates are proactively avoided. The web app does have functionality to identify duplicates — but it is very limited in functionality, and its lack is what got the database into having so many duplicates in the first place.

## Process/Requirement
I will have a Python application which will open the Excel file, walk through the rows and do a number of checks and validations (data missing etc) and reject rows that are not clean. One of the checks will be for potential duplicates — the customer [name] already exists in the database, I have access to the database in PostgreSQL. 

### PostgreSQL Search Options
PostgreSQL has a number of search options like `levenshtein`, `soundex` and [Trigrams](https://www.postgresql.org/docs/current/pgtrgm.html) as well as `tsvector`. I have tried the first three but one way or the other they don't give me the same duplication functionality I have seen with String Grouper using cosine similarities.

## Possibility of Using String Grouper?
So I did a simple test in String Grouper importing the existing data (id & name) into a pandas DataFrame `accounts` and then making a one row DataFrame `duplicates` to hold the string being queried.

```python
accounts = pd.read_csv('data/accounts.tsv', delimiter='\t', lineterminator='\n')

duplicates = pd.DataFrame(
    [
        ('1111', 'Some Company Ltd')
    ],
    columns=('duplicate_id', 'duplicate_name')
)

matches = match_strings(
    accounts['name'],
    duplicates['duplicate_name'],
    master_id=accounts['id'],
    duplicates_id=duplicates['duplicate_id'],
    ignore_index=True,
    max_n_matches=20,
    min_similarity=0.8
)
```
It works, however I see that each time I call the String Grouper `match_strings` function it takes some time as I believe **it is rebuilding the corpus each time** because I have to call the function each time I step through the rows in the tabular data being imported. Obviously this is not efficient.

## Question
So, to the question. Is it possible to run String Grouper such that the corpus is built **one time only** for the duration of the Python running application. So: 

1. Initialise the Python program
2. Load the dataframe with the many rows `accounts`
3. Build the corpus — once only
4. Loop: 
    1. Step through the rows getting the candidate (potential duplicate) string `name`
    2. run `match_strings` for each row.
5. End loop:

Thanks in advance for any help/guidance.

p.s. This might be good functionality to build into the package, it is likely a fairly common use case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69

Background

Process/Requirement

PostgreSQL Search Options

Possibility of Using String Grouper?

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69

Description

Background

Process/Requirement

PostgreSQL Search Options

Possibility of Using String Grouper?

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions