PyData Miami 2022

Enterprise Semantic Search with Python Large Language Models
09-22, 17:50–18:20 (US/Eastern), Main Room

Enterprise Search is a key use case in big data and business computing. In this talk we introduce Enterprise Semantic Search with Large Language Models (LLMs), and present a working demonstration in the financial domain. Semantic search is search based on meaning representations, instead of literal document and query keywords. We use the recent HuggingFace transformers library, together with related Python libraries (TensorFlow, sklearn and UMAP) for NLP and deep learning. Approaches, data visualization, metrics and datasets for search system evaluation are introduced. The talk will be of interest to developers working on text search and new unstructured data applications. Slides and a demo notebook will be available at the time of PyData Miami 2022.


This talk presents Enterprise Semantic Search with Python Large Language Models (LLMs) and a working demonstration in the financial domain. We use the recent HuggingFace transformers library, together with related Python libraries (TensorFlow, PyTorch, sklearn and UMAP) for NLP and deep learning.

The talk is about search technology and working code. Attendees should walk away understanding what Enterprise Search, Semantic Search, vector space representations and Large Language Models are, and know the basics of working with HuggingFace transformers.

Talk outline with estimated times per section:
1. Introduction: Enterprise search; Semantic search; Large Language Models; and HuggingFace Transformers; 5 min.
2. Traditional and neural information retrieval: sparse and dense vector space representations of queries and documents; 5 min.
3. Financial semantic search python application; dataset, task and performance; 10 min.
4. t-SNE and UMAP data visualization in dense vector spaces; 2 min.
5. Model risk and ethics considerations (data and model cards); 3 min.
6. Questions; 5 min.

The talk will be of interest to developers working on text search and new unstructured data applications. Slides and a repository of the dataset and Python code created will be available at the time of PyData Miami 2022.

Python libraries referenced: NumPy, pandas, Scikit-Learn, umap, transformers, datasets, TensorFlow, Flask.

Background knowledge assumed: python, vector spaces.
Useful but not assumed: information retrieval, linear algebra, machine learning and deep learning.


Prior Knowledge Expected

Previous knowledge expected

Nelson is Founder and CEO of Andinum, Inc., and has 30 years of experience in natural language processing, machine learning and software development. Prior to Andinum, Nelson was data architect and data scientist at Bank of America; postdoctoral researcher, visiting scientist and senior software engineer in natural language processing at IBM Research; Professor in the Department of Electrical Engineering at Universidad de los Andes; and Vice-president of Engineering at two VC-funded startups in New York. Nelson holds a Ph.D. degree in Electrical Engineering and a Masters degree in Mathematics from Syracuse University.