Company entity matching (and deduplication)

Date:2016

Scoped, provided outside in knowledge through partnerships / OEM vendors, and developed monetization strategy for Entity matching / record linkage as-a-service. Migrated from rule based record linkage to ML based models (random forest & logistic regression) using hadoop and spark applications with 20% increase in F1 and 40% increase in recall while maintaining 90% precision. Core features include:

Tokenization: Significant prediction, segmentation, run together parsing, ordering, stop words
Transformation: entity names, address, phone, and url canonicalization
Business Taxonomies: Abbreviations, acronyms, alternate names, word expansions, misspellings, internationalization
Feature blocking: tokenization, fuzzy (ratio, token set ratio, token partial ratio)
Context: Family Tree (subsidiaries, corporate linkages)

Increase content extraction and yield by ~100%; enabled data cleansing applications (workflow and API ) with monetization of $1.1M in ACV in first 18 months (2016).

Refresh NEW UI