Skip to content

Grounding Data Analysis with Domain‐Specific Ontologies

pgalko edited this page May 18, 2025 · 2 revisions

Knowledge Graph-Driven Data Analysis: Structure and Application

Objective: To significantly enhance the accuracy, reliability, and domain-relevance of AI-driven data analysis by grounding Large Language Models (LLMs) with formally defined domain knowledge.

Target Audience: Researchers, data scientists, developers, and domain experts looking to build more robust and context-aware AI data analysis solutions.


1. The Challenge: LLMs and Specialized Domains

Large Language Models (LLMs) are incredibly powerful but have inherent limitations when applied to specialized analytical tasks. This is particularly true in domains with nuanced methodologies, complex data relationships, or terminology not well-represented in their general training data. Even with techniques like Retrieval Augmented Generation (RAG), the LLM might retrieve relevant documents but still struggle to interpret and apply that information correctly without a more structured understanding of the domain's core concepts and rules. This can lead to:

  • Hallucinations: Generating plausible-sounding but incorrect or irrelevant analytical steps, visualizations, or interpretations.
  • Methodological Errors: Applying generic approaches where domain-specific calculations, data transformations, or statistical considerations are paramount.
  • Misinterpretation of Data: Failing to understand the specific meaning, units, inter-column relationships, or the implications of missing values, especially when dealing with multiple, interconnected datasets.

The sports and performance science domain is a prime example. It involves data from diverse sources (wearables, lab tests, subjective feedback), often with unique metrics and complex interdependencies (e.g., how training load impacts recovery markers like sleep quality or resting heart rate). An LLM, without explicit guidance, might struggle to understand and navigate this complexity effectively.

2. The Solution: BambooAI's Dataframe Ontology

BambooAI introduces a powerful mechanism to address these challenges: custom dataframe ontologies. An ontology is a formal, explicit specification of a shared conceptualization. In simpler terms for this context, it's a way to define the structure, concepts, relationships, and rules within your specific data domain. Think of it as the blueprint or schema that defines how a knowledge graph for your domain would be structured. By providing BambooAI with an ontology (in .ttl RDF/OWL format), you equip it with this structured "cheat sheet" specific to your analytical domain. The system then uses this ontology to build a task-specific, graph-like understanding to guide its analysis.

2.1. Ontology Structure and Content (Sports & Performance Example)

To make this concept more concrete, BambooAI includes a sample ontology specifically designed for the sports and performance science domain. This file, named Sports_Data_Ontology.ttl, can be found in the web_app/ontologies/ directory of the repository (view it here). This example, which we will refer to throughout this section, serves as a practical template and illustrates how to define:

  • Classes (owl:Class): Define the main concepts or categories of things in your domain.

    • Data containers: :ActivityDataframe, :WellnessDataframe
    • Data records/entities: :Activity, :Segment, :ActivityMeasurement, :WellnessMeasurement
    • Measurement types: :DirectlyMeasured, :Derived, :PreComputed (describing how a value is obtained)
    • Measurement categories: :Metabolic, :Mechanical, :Geospatial (grouping related metrics)
    • Helper objects: :Function, :Key (for defining computations and identifiers)

    From Sports_Data_Ontology.ttl:

    # Defines 'Activity' as a class, which is a specialized type of 'Timeseries'
    :Activity rdf:type owl:Class ;
              rdfs:subClassOf :Timeseries .
    
    # Defines 'Heartrate' as a specific, named concept (an individual)
    # It belongs to several classes, describing its nature
    :Heartrate rdf:type owl:NamedIndividual ,       # It's a specific concept
                       :ActivityMeasurement ,     # It's a type of measurement in an activity
                       :DirectlyMeasured ,        # It's directly measured
                       :Metabolic ;               # It falls under the metabolic category
               :isPresentInDataset "true"^^xsd:boolean ; # Indicates it's expected in the dataset
               :measuredInUnits "BPM" .                  # Specifies its unit of measurement
  • Object Properties (owl:ObjectProperty): Define relationships between classes or individuals.

    • :containsActivity (links :ActivityDataframe to :Activity)
    • :canBeComputedUsingFunction (links a metric like :HRDrift to its calculation function :calculateHRDrift)
    • :functionRequiresMeasurements (links a :Function to the :MeasurementCategory or specific measurements it needs as input)

    From Sports_Data_Ontology.ttl:

    # Defines a relationship 'containsLap'
    :containsLap rdf:type owl:ObjectProperty ;
                 rdfs:domain :Activity ; # An Activity can contain laps
                 rdfs:range :Segment .   # A Lap is a type of Segment
  • Data Properties (owl:DatatypeProperty): Define attributes of classes/individuals that have literal values (like strings, numbers, booleans).

    • :measuredInUnits (e.g., "BPM", "Watts", "Meters/Second", "Degrees")
    • :functionDefinition (contains the actual Python code for a function)
    • :isPresentInDataset (boolean flag indicating if a field is expected in the data)
    • :allowedValues (e.g., for a categorical field like :ActivityType, specifying "Run", "Ride", "Swim")

    From Sports_Data_Ontology.ttl:

    # Defines 'functionDefinition' as a property that holds a string (the code)
    :functionDefinition rdf:type owl:DatatypeProperty ;
                        rdfs:domain :Function ; # Functions have definitions
                        rdfs:range xsd:string . # The definition is a string
  • Individuals (owl:NamedIndividual): Represent specific instances or concrete concepts within your domain.

    • Specific metrics: :Heartrate, :Power, :Pace, :Cadence, :Gradient
    • Specific functions: :calculateHRDrift, :calculateFunctionaThresholdPower, :computeMeanMaxCurve, calculateCriticalPower
    • Specific dataframes (as conceptual types): :ActivityDataframe (representing the primary dataset type for activities)

    From Sports_Data_Ontology.ttl:

    :calculateHRDrift rdf:type owl:NamedIndividual , :Function ; # It's a specific function
        :applicableToDataObject :Activity ; # It applies to Activity data
        :functionRequiresMeasurements :Datetime, :Heartrate, :Power ; # Its inputs
        :functionDefinition """
    # Calculate HR drift using power data for exertion normalization.
    Imports:
        import pandas as pd
        import numpy as np
        from scipy import stats
    
    Parameters:
        df (pandas.DataFrame): DataFrame containing at least 'Datetime', 'Heartrate', and 'Power' columns.
    
    Returns:
        float: HR drift per hour (normalized units per hour)
    
    Function Body:    
        # Ensure the DataFrame is sorted by Datetime
        df['Datetime'] = pd.to_datetime(df['Datetime'])
        # ... (rest of the Python code) ...
        return drift_per_hour
    """ ; # The actual Python code for the function
        rdfs:comment "This function is used to measure cardiovascular drift..." .

The key power here is the ability to define custom functions like :calculateHRDrift or :calculatePaceFunction directly within the ontology (as seen in Sports_Data_Ontology.ttl), including their Python code, required inputs, and descriptive comments. This ensures the LLM uses validated, domain-correct calculations rather than attempting to invent them.

3. Extracting Domain Knowledge: The Dataframe Inspector Agent

When a user poses an analytical question, and an ontology is provided, BambooAI's Dataframe Inspector agent takes center stage. Its role is to:

  1. Receive: The user's task, the full ontology, a preview of the primary dataframe, and a list of available auxiliary datasets.
  2. Analyze: It uses an LLM, guided by a sophisticated internal prompt (see dataframe_inspector_user in the project's context), to parse these inputs. This prompt is crucial for directing the LLM's attention and reasoning.
  3. Extract: It selectively extracts only the relevant pieces of information from the ontology and dataset previews that are pertinent to solving the user's specific task. This is vital for keeping the context provided to downstream agents concise and focused.
  4. Structure: This extracted knowledge is compiled into a structured YAML format.

To perform this extraction efficiently, especially in terms of cost and latency, BambooAI can be configured to use smaller, highly capable models for the Dataframe Inspector agent. For instance, models like Google's Gemini Flash 2.0 have shown to be particularly adept at this kind of structured information extraction task, offering a good balance of performance and resource utilization. This allows the more powerful (and potentially more expensive) models to be reserved for complex planning or code generation tasks.

The extraction prompt is carefully designed to ensure the LLM focuses on:

  • Task Relevance: Only entities, attributes, relationships, and functions needed for the current query are included. If the user asks about cycling power, details about swimming stroke rates (if also in the ontology) would be omitted from the YAML.
  • Data Source Identification: Clearly linking ontology concepts to actual datasets (primary or auxiliary) using dataset_source_identifier and domain_label. This helps in scenarios with multiple data files.
  • Verbatim Function Extraction: Custom functions defined in the ontology are extracted exactly as written, preventing the LLM from inventing or altering domain-specific code.
  • Hierarchical Understanding: Preserving the structure of the data (e.g., an ActivityDataframe contains Activitys, which contain ActivityMeasurements).

Example Snippet of the Extraction Prompt's Logic (Illustrative): The prompt instructs the Dataframe Inspector to create a YAML structure including:

# ... (metadata, data_hierarchy) ...
functions:
  REQUIREMENTS:
  - ONLY extract functions defined in the ontology.
  - ONLY include functions needed for this task.
  - Extract VERBATIM from the ontology.
  - NO modifications or additions.
  - NO invented functions.
# ... (relationships) ...

This strict guidance ensures the integrity of domain-specific logic when it's passed to other agents.

4. Grounding the LLM: Applying the Extracted Knowledge

The YAML output from the Dataframe Inspector is not just an intermediate artifact; it becomes a critical part of the context for subsequent LLM agents in BambooAI's workflow, such as the Planner and Code Generator. Think of this YAML as a dynamic, task-specific schema and rulebook derived from the broader domain ontology.

By injecting this task-specific, ontology-derived YAML into their context window, BambooAI achieves several benefits:

  • Reduced Hallucinations: The LLM is "reminded" of the valid entities, metrics, relationships, and constraints relevant only to the task at hand. This drastically narrows the scope for making things up.
  • Methodological Adherence: If the task requires a custom calculation (e.g., "Calculate HR Drift"), the YAML will provide the exact Python function from the ontology, ensuring the Code Generator uses it. It won't try to find a generic HR drift formula from its training data.
  • Improved Data Comprehension: The LLM gains a clearer understanding of column names, units, expected values, and how different datasets (e.g., activity data and wellness data) might be linked (e.g., via a common Date key, as specified in the YAML's relationship section).
  • Contextual Code Generation: The Code Generator can produce more accurate and efficient Python code because it's working with a well-defined, relevant subset of the domain model. It knows which columns to use, how to join dataframes if needed, and what functions are available.

This process effectively "grounds" the LLM's reasoning and code generation capabilities within the confines and specifics of the user's domain, as defined by the ontology and refined by the task.

5. Visualizing the Domain Model for the User

To provide transparency and help users understand how BambooAI is interpreting their data and task in light of the ontology, a visualization of the extracted YAML is generated and presented in the UI.

  • Mechanism: The Python script (generate_model_graph in the project's context) parses the YAML output from the Dataframe Inspector.
  • Output: It generates a Mermaid diagram. Mermaid is a JavaScript-based diagramming and charting tool that renders Markdown-inspired text definitions to create and modify diagrams dynamically.
  • Content: This diagram visually represents:
    • The identified data hierarchies (e.g., ActivityDataframe, WellnessDataframe) often grouped into subgraphs by domain_label.
    • Key entities, measurements, and attributes relevant to the task, with some of their properties displayed.
    • Functions extracted from the ontology that are pertinent to the query.
    • Relationships between these components (e.g., "contains", "has key", "applies to", "computes").
    • Cross-dataset links if applicable (e.g., how activity data might merge with wellness data, often shown with a dashed line and merge key information).
  • Interactivity: Depending on the Mermaid rendering environment, these diagrams can offer some level of interactivity, such as hover tooltips for more details on nodes or relationships.

Example of what the Mermaid diagram might represent (conceptual):

image

This visualization serves multiple purposes:

  • Transparency: Users can see how their domain knowledge is being understood and utilized for their specific query.
  • Debugging: If the analysis is not as expected, the graph can help pinpoint misunderstandings in the data interpretation or if a crucial piece of ontology was missed by the extractor.
  • Confidence: It builds user trust by showing a structured, logical representation of the problem space as understood by the AI.
  • Learning: Users can learn more about the structure of their data and the defined relationships by examining the graph.

6. Broader Applicability

While the sports and performance domain provides a compelling example due to its inherent complexity and non-standardized data, the BambooAI ontology feature is domain-agnostic. Any field that benefits from structured domain knowledge can leverage this capability:

  • Finance: Defining financial instruments (stocks, bonds, derivatives), trading strategies, regulatory compliance rules (e.g., MiFID II, Basel III), risk calculation formulas (e.g., VaR, Sharpe Ratio), and chart of accounts structures.
  • Healthcare & Life Sciences: Modeling patient data (EHRs), medical conditions (using SNOMED CT or ICD-10 codes), treatment protocols, drug interactions, genomic pathways, and epidemiological calculations.
  • Manufacturing & IoT: Describing production lines, machine sensors (and their data streams), quality control parameters, maintenance schedules, supply chain logistics, and defect analysis logic.
  • E-commerce & Marketing: Defining product catalogs (with complex attributes and variants), customer segments, recommendation algorithms, marketing campaign structures, and sales funnel metrics.
  • Scientific Research (General): Encoding experimental designs, types of observations, measurement units and error margins, statistical models, and domain-specific formulas for any scientific discipline (e.g., physics, chemistry, environmental science).
  • Legal Tech: Structuring case law, contractual clauses, legal precedents, and compliance frameworks.

The power lies in enabling users to inject their precise, expert knowledge into the LLM's analytical process. This transforms the LLM from a generalist into a domain-aware assistant, leading to more reliable, accurate, and insightful results. We encourage users to develop and share ontologies for their specific domains to further enhance the capabilities of BambooAI and foster a community of practice around domain-grounded AI.