-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Hi this the logical progression, next step, to the clean-up process of data extracted from a database having an ID and name field. It's a follow on from the use case documented in the first String Grouper tutorial which I wrote.
Background
To recap the requirement back then — there was a reasonably large database of accounts with an ID and customer name. Browsing the database it could be seen by eye that there were many, many duplicates. Using String Grouper (with the added functionality to include IDs in the matching process) it is possible to (insanely) quickly get a list of IDs of duplicate records which are then imported back into the database to do a join with the original table. It's all in the tutorial.
The persons responsible are now starting the clean up necessary to remove all the duplicates, but now I want to setup a solution to ensure that when bulk imports are done (typically using Excel) in the future to this system that duplicates are proactively avoided. The web app does have functionality to identify duplicates — but it is very limited in functionality, and its lack is what got the database into having so many duplicates in the first place.
Process/Requirement
I will have a Python application which will open the Excel file, walk through the rows and do a number of checks and validations (data missing etc) and reject rows that are not clean. One of the checks will be for potential duplicates — the customer [name] already exists in the database, I have access to the database in PostgreSQL.
PostgreSQL Search Options
PostgreSQL has a number of search options like levenshtein
, soundex
and Trigrams as well as tsvector
. I have tried the first three but one way or the other they don't give me the same duplication functionality I have seen with String Grouper using cosine similarities.
Possibility of Using String Grouper?
So I did a simple test in String Grouper importing the existing data (id & name) into a pandas DataFrame accounts
and then making a one row DataFrame duplicates
to hold the string being queried.
accounts = pd.read_csv('data/accounts.tsv', delimiter='\t', lineterminator='\n')
duplicates = pd.DataFrame(
[
('1111', 'Some Company Ltd')
],
columns=('duplicate_id', 'duplicate_name')
)
matches = match_strings(
accounts['name'],
duplicates['duplicate_name'],
master_id=accounts['id'],
duplicates_id=duplicates['duplicate_id'],
ignore_index=True,
max_n_matches=20,
min_similarity=0.8
)
It works, however I see that each time I call the String Grouper match_strings
function it takes some time as I believe it is rebuilding the corpus each time because I have to call the function each time I step through the rows in the tabular data being imported. Obviously this is not efficient.
Question
So, to the question. Is it possible to run String Grouper such that the corpus is built one time only for the duration of the Python running application. So:
- Initialise the Python program
- Load the dataframe with the many rows
accounts
- Build the corpus — once only
- Loop:
- Step through the rows getting the candidate (potential duplicate) string
name
- run
match_strings
for each row.
- Step through the rows getting the candidate (potential duplicate) string
- End loop:
Thanks in advance for any help/guidance.
p.s. This might be good functionality to build into the package, it is likely a fairly common use case.