Skip to content

MORE FEATURES NEEDED - Docker-compose deployment + Web crawling + Social media ingestion + Graphiti RAG integration Problem Statement #66

@Kakachia777

Description

@Kakachia777

Problem Statement

Currently, ii-agent requires significant manual setup and lacks persistent knowledge capabilities:

  • Complex deployment: Users must manually install dependencies, configure databases, set up workspace directories
  • Limited data sources: Only web search available, no crawling or social media ingestion
  • No knowledge persistence: Agent forgets everything between sessions, no RAG system
  • Poor developer experience: Multiple config files, manual API key management, scattered setup steps

Current user journey: Clone repo → Install Python deps → Configure workspace → Set up database → Configure API keys → Debug connection issues → Finally test agent

Desired user journey: Clone repo → docker compose up → Add API keys to .env → Agent ready with crawling + RAG

Proposed Solution

Add comprehensive data ingestion and knowledge management system with containerized deployment.

Core Components

1. Docker Infrastructure

  • Full docker-compose stack (agent + UI + postgres + redis + neo4j + graphiti + crawler workers)
  • Environment-based configuration
  • Volume persistence for data
  • Health checks and dependency management

2. Web Crawling System with Crawl4AI

  • WebCrawlTool using Crawl4AI v0.5.0 for content extraction
  • Deep crawling capabilities with configurable strategies (BFS, DFS, Best-First)
  • Memory-Adaptive Dispatcher for large-scale crawls with dynamic concurrency
  • LLM-friendly markdown output perfect for RAG pipeline ingestion
  • Redis-based job queue for async crawling
  • Automatic content deduplication and chunking

3. Social Media Connectors

  • Reddit: PRAW-based fetcher for user comments/posts (up to 1000 items API limit)
  • Twitter/X: Tweepy v2 integration with rate limit handling
  • LinkedIn: Graph API v22 for public posts/analytics
  • TikTok: Research API support (graceful 403 handling for non-academic tokens)
  • Instagram: Graph API integration
  • Other Socials...

4. Graphiti Knowledge Graph Integration

  • Neo4j-backed knowledge graph with semantic search and relationship extraction
  • Episode-based memory: Stores conversation turns as episodes with automatic fact extraction
  • Advanced search capabilities:
    • Reciprocal Rank Fusion (RRF) combining BM25 and semantic search
    • Maximal Marginal Relevance (MMR) for diverse results
    • Cross-encoder reranking with OpenAI or BGE models
  • RAGQueryTool for agent to retrieve relevant context with graph relationships
  • Session-scoped knowledge with cross-session learning
  • Entity relationship mapping between people, topics, and content sources

Technical Implementation

File Structure

├── docker-compose.yml
├── Dockerfile.crawler
├── .env.example
├── src/
│   ├── ii_agent/tools/
│   │   ├── web_crawl_tool.py
│   │   ├── social_ingestion_tool.py
│   │   └── rag_query_tool.py
│   ├── crawler/
│   │   ├── worker.py
│   │   ├── crawl4ai_config.py
│   │   └── processors/
│   ├── social/
│   │   ├── reddit_fetcher.py
│   │   ├── twitter_fetcher.py
│   │   └── linkedin_fetcher.py
│   └── graphiti/
│       ├── client.py
│       └── episode_manager.py

Key APIs

WebCrawlTool (using Crawl4AI deep crawling)

{
  "name": "web_crawl",
  "input": {
    "urls": ["https://example.com"],
    "strategy": "BFS",
    "max_depth": 3,
    "max_pages": 50,
    "url_filters": ["*/blog/*", "*/docs/*"],
    "crawler_mode": "browser"
  }
}

SocialIngestionTool

{
  "name": "social_ingest", 
  "input": {
    "platform": "reddit",
    "count": 500,
    "filters": {"subreddits": ["python", "MachineLearning"]}
  }
}

RAGQueryTool (with Graphiti graph search)

{
  "name": "rag_query",
  "input": {
    "query": "What topics do I comment about most?",
    "search_type": "hybrid",
    "rerank_method": "rrf",
    "include_relationships": true,
    "top_k": 10
  }
}

Graphiti Integration Details

Episode Management

# src/graphiti/episode_manager.py
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
from graphiti_core.utils.bulk_utils import RawEpisode

class GraphitiEpisodeManager:
    def __init__(self, neo4j_uri, neo4j_user, neo4j_password):
        self.client = Graphiti(neo4j_uri, neo4j_user, neo4j_password)
        
    async def add_crawled_content(self, content, metadata):
        """Add crawled content as episodes to knowledge graph"""
        episode = RawEpisode(
            name=f"crawl_{metadata['url']}",
            episode_type=EpisodeType.CONVERSATION,
            content=content,
            source=metadata['url'],
            reference_time=metadata['timestamp']
        )
        await self.client.add_episode(episode)
        
    async def search_knowledge(self, query, search_type="hybrid"):
        """Search knowledge graph with advanced reranking"""
        results = await self.client.search(
            query=query,
            search_type=search_type,  # "semantic", "bm25", or "hybrid"
            rerank_method="rrf",      # "rrf", "mmr", or "cross_encoder"
            limit=10
        )
        return results

Docker Compose Services

services:
  ii-agent:
    build: .
    environment:
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/iiagent
      - NEO4J_URI=bolt://neo4j:7687
      - NEO4J_USER=neo4j
      - NEO4J_PASSWORD=graphiti_password
      - REDIS_URL=redis://redis:6379
      - CRAWL4AI_MEMORY_THRESHOLD=0.8
    depends_on: [postgres, redis, neo4j, crawl4ai]
    
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: iiagent
      POSTGRES_USER: postgres 
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
      
  redis:
    image: redis:7-alpine
    
  neo4j:
    image: neo4j:5.23
    environment:
      - NEO4J_AUTH=neo4j/graphiti_password
      - NEO4J_PLUGINS=["apoc"]
    ports:
      - "7474:7474"
      - "7687:7687"
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs
    
  # Crawl4AI as scalable service
  crawl4ai:
    image: unclecode/crawl4ai:latest
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN}
    ports:
      - "8001:8000"
    deploy:
      replicas: 2
      
  crawler-worker:
    build: 
      context: .
      dockerfile: Dockerfile.crawler
    environment:
      - CRAWL_QUEUE_URL=redis://redis:6379
      - CRAWL4AI_URL=http://crawl4ai:8000
      - NEO4J_URI=bolt://neo4j:7687
      - NEO4J_USER=neo4j
      - NEO4J_PASSWORD=graphiti_password
    depends_on: [redis, neo4j, crawl4ai]
    deploy:
      replicas: 3

volumes:
  postgres_data:
  neo4j_data:
  neo4j_logs:

Environment Configuration

.env.example

# Core
ANTHROPIC_API_KEY=your_anthropic_key
DATABASE_URL=postgresql://postgres:password@postgres:5432/iiagent

# Neo4j & Graphiti Configuration
NEO4J_URI=bolt://neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=graphiti_password

# Crawl4AI Configuration  
CRAWL4AI_API_TOKEN=your_crawl4ai_token
CRAWL4AI_MEMORY_THRESHOLD=0.8
CRAWL4AI_MAX_CONCURRENT=10

# Social Media APIs
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_secret
TWITTER_BEARER_TOKEN=your_twitter_token
LINKEDIN_CLIENT_ID=your_linkedin_id
LINKEDIN_CLIENT_SECRET=your_linkedin_secret

# Optional
TAVILY_API_KEY=your_tavily_key
JINA_API_KEY=your_jina_key

Benefits

  • Zero-setup deployment: Single command gets full stack running including Neo4j
  • Persistent knowledge graph: Agent builds and maintains relationships between entities
  • Advanced search capabilities: RRF, MMR, and cross-encoder reranking for better retrieval
  • Rich data sources: Web crawling + social media beyond just web search
  • Scalable architecture: Crawl4AI's memory-adaptive dispatcher + Redis queues
  • LLM-optimized content: Crawl4AI generates clean markdown perfect for RAG
  • Deep crawling capabilities: Explore entire websites with intelligent strategies
  • Graph-based memory: Graphiti automatically extracts entities and relationships
  • Developer friendly: Standard docker-compose, clear API structure
  • Production ready: Health checks, volume persistence, environment configuration

Use Cases Enabled

  • Personal knowledge assistant: "Deep crawl my blog and find connections between my ideas"
  • Research automation: "Monitor these 50 websites and track how topics evolve"
  • Social media analysis: "What topics am I engaging with most and how do they relate?"
  • Documentation assistant: "Crawl our docs and understand product relationships"
  • Content discovery: "Find unexpected connections between my saved articles and projects"
  • Competitive intelligence: "Track competitor content and identify relationship patterns"
  • Academic research: "Build knowledge graphs from paper citations and topics"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions