MORE FEATURES NEEDED - Docker-compose deployment + Web crawling + Social media ingestion + Graphiti RAG integration Problem Statement

## Problem Statement

**Currently, ii-agent requires significant manual setup and lacks persistent knowledge capabilities:**

- **Complex deployment**: Users must manually install dependencies, configure databases, set up workspace directories
- **Limited data sources**: Only web search available, no crawling or social media ingestion
- **No knowledge persistence**: Agent forgets everything between sessions, no RAG system
- **Poor developer experience**: Multiple config files, manual API key management, scattered setup steps

_**Current user journey**: Clone repo → Install Python deps → Configure workspace → Set up database → Configure API keys → Debug connection issues → Finally test agent_

_**Desired user journey**: Clone repo → `docker compose up` → Add API keys to `.env` → Agent ready with crawling + RAG_

## Proposed Solution

Add comprehensive data ingestion and knowledge management system with containerized deployment.

### Core Components

#### 1. Docker Infrastructure
- Full docker-compose stack (agent + UI + postgres + redis + neo4j + graphiti + crawler workers)
- Environment-based configuration
- Volume persistence for data
- Health checks and dependency management

#### 2. Web Crawling System with Crawl4AI
- **WebCrawlTool** using Crawl4AI v0.5.0 for content extraction
- **Deep crawling capabilities** with configurable strategies (BFS, DFS, Best-First)
- **Memory-Adaptive Dispatcher** for large-scale crawls with dynamic concurrency
- **LLM-friendly markdown output** perfect for RAG pipeline ingestion
- Redis-based job queue for async crawling
- Automatic content deduplication and chunking

#### 3. Social Media Connectors
- **Reddit**: PRAW-based fetcher for user comments/posts (up to 1000 items API limit)
- **Twitter/X**: Tweepy v2 integration with rate limit handling
- **LinkedIn**: Graph API v22 for public posts/analytics
- **TikTok**: Research API support (graceful 403 handling for non-academic tokens)
- **Instagram**: Graph API integration
- **Other Socials...** 

#### 4. Graphiti Knowledge Graph Integration
- **Neo4j-backed knowledge graph** with semantic search and relationship extraction
- **Episode-based memory**: Stores conversation turns as episodes with automatic fact extraction
- **Advanced search capabilities**: 
  - Reciprocal Rank Fusion (RRF) combining BM25 and semantic search
  - Maximal Marginal Relevance (MMR) for diverse results
  - Cross-encoder reranking with OpenAI or BGE models
- **RAGQueryTool** for agent to retrieve relevant context with graph relationships
- **Session-scoped knowledge** with cross-session learning
- **Entity relationship mapping** between people, topics, and content sources

## Technical Implementation

### File Structure
```
├── docker-compose.yml
├── Dockerfile.crawler
├── .env.example
├── src/
│   ├── ii_agent/tools/
│   │   ├── web_crawl_tool.py
│   │   ├── social_ingestion_tool.py
│   │   └── rag_query_tool.py
│   ├── crawler/
│   │   ├── worker.py
│   │   ├── crawl4ai_config.py
│   │   └── processors/
│   ├── social/
│   │   ├── reddit_fetcher.py
│   │   ├── twitter_fetcher.py
│   │   └── linkedin_fetcher.py
│   └── graphiti/
│       ├── client.py
│       └── episode_manager.py
```

### Key APIs

**WebCrawlTool** (using Crawl4AI deep crawling)
```json
{
  "name": "web_crawl",
  "input": {
    "urls": ["https://example.com"],
    "strategy": "BFS",
    "max_depth": 3,
    "max_pages": 50,
    "url_filters": ["*/blog/*", "*/docs/*"],
    "crawler_mode": "browser"
  }
}
```

**SocialIngestionTool**
```json
{
  "name": "social_ingest", 
  "input": {
    "platform": "reddit",
    "count": 500,
    "filters": {"subreddits": ["python", "MachineLearning"]}
  }
}
```

**RAGQueryTool** (with Graphiti graph search)
```json
{
  "name": "rag_query",
  "input": {
    "query": "What topics do I comment about most?",
    "search_type": "hybrid",
    "rerank_method": "rrf",
    "include_relationships": true,
    "top_k": 10
  }
}
```

### Graphiti Integration Details

**Episode Management**
```python
# src/graphiti/episode_manager.py
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
from graphiti_core.utils.bulk_utils import RawEpisode

class GraphitiEpisodeManager:
    def __init__(self, neo4j_uri, neo4j_user, neo4j_password):
        self.client = Graphiti(neo4j_uri, neo4j_user, neo4j_password)
        
    async def add_crawled_content(self, content, metadata):
        """Add crawled content as episodes to knowledge graph"""
        episode = RawEpisode(
            name=f"crawl_{metadata['url']}",
            episode_type=EpisodeType.CONVERSATION,
            content=content,
            source=metadata['url'],
            reference_time=metadata['timestamp']
        )
        await self.client.add_episode(episode)
        
    async def search_knowledge(self, query, search_type="hybrid"):
        """Search knowledge graph with advanced reranking"""
        results = await self.client.search(
            query=query,
            search_type=search_type,  # "semantic", "bm25", or "hybrid"
            rerank_method="rrf",      # "rrf", "mmr", or "cross_encoder"
            limit=10
        )
        return results
```

### Docker Compose Services

```yaml
services:
  ii-agent:
    build: .
    environment:
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/iiagent
      - NEO4J_URI=bolt://neo4j:7687
      - NEO4J_USER=neo4j
      - NEO4J_PASSWORD=graphiti_password
      - REDIS_URL=redis://redis:6379
      - CRAWL4AI_MEMORY_THRESHOLD=0.8
    depends_on: [postgres, redis, neo4j, crawl4ai]
    
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: iiagent
      POSTGRES_USER: postgres 
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
      
  redis:
    image: redis:7-alpine
    
  neo4j:
    image: neo4j:5.23
    environment:
      - NEO4J_AUTH=neo4j/graphiti_password
      - NEO4J_PLUGINS=["apoc"]
    ports:
      - "7474:7474"
      - "7687:7687"
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs
    
  # Crawl4AI as scalable service
  crawl4ai:
    image: unclecode/crawl4ai:latest
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN}
    ports:
      - "8001:8000"
    deploy:
      replicas: 2
      
  crawler-worker:
    build: 
      context: .
      dockerfile: Dockerfile.crawler
    environment:
      - CRAWL_QUEUE_URL=redis://redis:6379
      - CRAWL4AI_URL=http://crawl4ai:8000
      - NEO4J_URI=bolt://neo4j:7687
      - NEO4J_USER=neo4j
      - NEO4J_PASSWORD=graphiti_password
    depends_on: [redis, neo4j, crawl4ai]
    deploy:
      replicas: 3

volumes:
  postgres_data:
  neo4j_data:
  neo4j_logs:
```

### Environment Configuration

**.env.example**
```bash
# Core
ANTHROPIC_API_KEY=your_anthropic_key
DATABASE_URL=postgresql://postgres:password@postgres:5432/iiagent

# Neo4j & Graphiti Configuration
NEO4J_URI=bolt://neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=graphiti_password

# Crawl4AI Configuration  
CRAWL4AI_API_TOKEN=your_crawl4ai_token
CRAWL4AI_MEMORY_THRESHOLD=0.8
CRAWL4AI_MAX_CONCURRENT=10

# Social Media APIs
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_secret
TWITTER_BEARER_TOKEN=your_twitter_token
LINKEDIN_CLIENT_ID=your_linkedin_id
LINKEDIN_CLIENT_SECRET=your_linkedin_secret

# Optional
TAVILY_API_KEY=your_tavily_key
JINA_API_KEY=your_jina_key
```

## Benefits

- **Zero-setup deployment**: Single command gets full stack running including Neo4j
- **Persistent knowledge graph**: Agent builds and maintains relationships between entities
- **Advanced search capabilities**: RRF, MMR, and cross-encoder reranking for better retrieval
- **Rich data sources**: Web crawling + social media beyond just web search
- **Scalable architecture**: Crawl4AI's memory-adaptive dispatcher + Redis queues
- **LLM-optimized content**: Crawl4AI generates clean markdown perfect for RAG
- **Deep crawling capabilities**: Explore entire websites with intelligent strategies
- **Graph-based memory**: Graphiti automatically extracts entities and relationships
- **Developer friendly**: Standard docker-compose, clear API structure
- **Production ready**: Health checks, volume persistence, environment configuration

## Use Cases Enabled

- **Personal knowledge assistant**: "Deep crawl my blog and find connections between my ideas"
- **Research automation**: "Monitor these 50 websites and track how topics evolve"
- **Social media analysis**: "What topics am I engaging with most and how do they relate?"
- **Documentation assistant**: "Crawl our docs and understand product relationships"
- **Content discovery**: "Find unexpected connections between my saved articles and projects"
- **Competitive intelligence**: "Track competitor content and identify relationship patterns"
- **Academic research**: "Build knowledge graphs from paper citations and topics"



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MORE FEATURES NEEDED - Docker-compose deployment + Web crawling + Social media ingestion + Graphiti RAG integration Problem Statement #66

Problem Statement

Proposed Solution

Core Components

1. Docker Infrastructure

2. Web Crawling System with Crawl4AI

3. Social Media Connectors

4. Graphiti Knowledge Graph Integration

Technical Implementation

File Structure

Key APIs

Graphiti Integration Details

Docker Compose Services

Environment Configuration

Benefits

Use Cases Enabled

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MORE FEATURES NEEDED - Docker-compose deployment + Web crawling + Social media ingestion + Graphiti RAG integration Problem Statement #66

Description

Problem Statement

Proposed Solution

Core Components

1. Docker Infrastructure

2. Web Crawling System with Crawl4AI

3. Social Media Connectors

4. Graphiti Knowledge Graph Integration

Technical Implementation

File Structure

Key APIs

Graphiti Integration Details

Docker Compose Services

Environment Configuration

Benefits

Use Cases Enabled

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions