Skip to content

Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69

@justasojourner

Description

@justasojourner

Hi this the logical progression, next step, to the clean-up process of data extracted from a database having an ID and name field. It's a follow on from the use case documented in the first String Grouper tutorial which I wrote.

Background

To recap the requirement back then — there was a reasonably large database of accounts with an ID and customer name. Browsing the database it could be seen by eye that there were many, many duplicates. Using String Grouper (with the added functionality to include IDs in the matching process) it is possible to (insanely) quickly get a list of IDs of duplicate records which are then imported back into the database to do a join with the original table. It's all in the tutorial.

The persons responsible are now starting the clean up necessary to remove all the duplicates, but now I want to setup a solution to ensure that when bulk imports are done (typically using Excel) in the future to this system that duplicates are proactively avoided. The web app does have functionality to identify duplicates — but it is very limited in functionality, and its lack is what got the database into having so many duplicates in the first place.

Process/Requirement

I will have a Python application which will open the Excel file, walk through the rows and do a number of checks and validations (data missing etc) and reject rows that are not clean. One of the checks will be for potential duplicates — the customer [name] already exists in the database, I have access to the database in PostgreSQL.

PostgreSQL Search Options

PostgreSQL has a number of search options like levenshtein, soundex and Trigrams as well as tsvector. I have tried the first three but one way or the other they don't give me the same duplication functionality I have seen with String Grouper using cosine similarities.

Possibility of Using String Grouper?

So I did a simple test in String Grouper importing the existing data (id & name) into a pandas DataFrame accounts and then making a one row DataFrame duplicates to hold the string being queried.

accounts = pd.read_csv('data/accounts.tsv', delimiter='\t', lineterminator='\n')

duplicates = pd.DataFrame(
    [
        ('1111', 'Some Company Ltd')
    ],
    columns=('duplicate_id', 'duplicate_name')
)

matches = match_strings(
    accounts['name'],
    duplicates['duplicate_name'],
    master_id=accounts['id'],
    duplicates_id=duplicates['duplicate_id'],
    ignore_index=True,
    max_n_matches=20,
    min_similarity=0.8
)

It works, however I see that each time I call the String Grouper match_strings function it takes some time as I believe it is rebuilding the corpus each time because I have to call the function each time I step through the rows in the tabular data being imported. Obviously this is not efficient.

Question

So, to the question. Is it possible to run String Grouper such that the corpus is built one time only for the duration of the Python running application. So:

  1. Initialise the Python program
  2. Load the dataframe with the many rows accounts
  3. Build the corpus — once only
  4. Loop:
    1. Step through the rows getting the candidate (potential duplicate) string name
    2. run match_strings for each row.
  5. End loop:

Thanks in advance for any help/guidance.

p.s. This might be good functionality to build into the package, it is likely a fairly common use case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions