Implement Persistent Report Storage With Vector Search And Role-Based Access A Comprehensive Guide

by James Vasile 99 views

Hey guys! Today, we're diving deep into a crucial upgrade for our TMS reports: implementing persistent storage with vector search and role-based access. This is going to revolutionize how we handle reports, making them more accessible, searchable, and secure. Let’s break it down!

Problem Statement: Why We Need Persistent Report Storage

Currently, our TMS reports (TMP, QO2, Team Signals) are fetched on-demand from external APIs. While this works, it’s not ideal. Think of it like ordering a pizza every time you're hungry instead of having some leftovers in the fridge. Here are the key limitations we’re facing:

  1. No Historical Access: Reports are regenerated each time, so we lose valuable historical snapshots. Imagine trying to track progress over time but only having the most recent data—frustrating, right?
  2. External Dependencies: Our report images rely on external API availability. If those APIs go down, so do our reports. We need to reduce this dependency.
  3. No Contextual Search: Agents can't search within user reports for specific answers. This means digging through reports manually, which is time-consuming and inefficient.
  4. Limited Access Control: We lack role-based access to team reports for managers. This is a security concern and limits collaboration.
  5. Performance Issues: Re-fetching and re-processing reports each time someone accesses them is a drag on performance. We need to make this faster and smoother.

Proposed Solution: A Comprehensive Report Storage System

So, what’s the fix? We need a comprehensive report storage system that addresses these limitations head-on. Our solution will:

  • Persist report HTML and images in Supabase. This means storing our reports in a robust and scalable database.
  • Chunk report content for vector search. This will allow us to search within reports for specific information quickly and accurately.
  • Integrate with agents for contextual responses. Our agents will be able to pull relevant information from reports to provide better assistance.
  • Enforce role-based access control. Managers will have access to team reports, ensuring data security and proper access levels.
  • Provide offline access to reports. This means users can view reports even without an internet connection.

Technical Architecture: How It All Works

Let's get into the nitty-gritty of how this system will be structured. We'll be building a robust architecture with several key components:

1. Database Schema: The Blueprint of Our Data

We'll start by defining our database schema. Think of this as the blueprint for how our data will be stored and organized. Here’s a breakdown of the tables we'll need:

  • user_reports Table: This table stores the complete reports. Each report will have a unique ID, user ID, team ID, organization ID, report type, subscription ID, template ID, raw HTML, processed HTML, metadata (like scores and insights), processing status, and timestamps for creation and updates.

    -- Store complete reports
    CREATE TABLE user_reports (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      user_id TEXT NOT NULL,
      team_id TEXT,
      organization_id TEXT NOT NULL,
      report_type TEXT NOT NULL CHECK (report_type IN ('TMP', 'QO2', 'TeamSignals', 'TeamSignals360')),
      subscription_id TEXT NOT NULL,
      template_id TEXT NOT NULL,
      raw_html TEXT NOT NULL,
      processed_html TEXT, -- HTML with local image URLs
      metadata JSONB, -- scores, profile, insights, etc.
      processing_status TEXT DEFAULT 'pending',
      processed_at TIMESTAMP WITH TIME ZONE,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
      updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );
    

    This table is the central hub for all report data, ensuring we capture every detail from the raw HTML to processed versions and relevant metadata. The CHECK constraint on report_type ensures we only store valid report types, maintaining data integrity. Timestamps help us track when reports were created and last updated, crucial for auditing and historical analysis. The metadata column, using JSONB, allows us to store flexible data structures like scores, profiles, and insights, making the schema adaptable to future needs.

  • report_images Table: This table stores downloaded images associated with reports. It includes the image ID, report ID, original URL, storage path (where the image is stored in Supabase), image type (wheel, graph, asset), alt text, description, metadata (like chart parameters), and creation timestamp.

    -- Store downloaded images
    CREATE TABLE report_images (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      report_id UUID REFERENCES user_reports(id) ON DELETE CASCADE,
      original_url TEXT NOT NULL,
      storage_path TEXT NOT NULL, -- Supabase storage path
      image_type TEXT NOT NULL, -- wheel, graph, asset
      alt_text TEXT,
      description TEXT,
      metadata JSONB, -- chart parameters
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );
    

    By storing images separately, we optimize storage and retrieval, especially since images can be large. The original_url is essential for tracking the source, while storage_path points to the location in our Supabase storage. Including alt_text and description enhances accessibility and SEO, making the reports more user-friendly and discoverable. The image_type helps categorize images for different use cases, and the metadata column can store chart-specific parameters for dynamic rendering or analysis.

  • report_chunks Table: This is where we store chunked content for vector search. It includes the chunk ID, report ID, section ID, section title, content (the actual chunk of text), chunk index, character count, embedding (generated using OpenAI), metadata, and creation timestamp.

    -- Store chunked content for vector search
    CREATE TABLE report_chunks (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      report_id UUID REFERENCES user_reports(id) ON DELETE CASCADE,
      section_id TEXT NOT NULL,
      section_title TEXT NOT NULL,
      content TEXT NOT NULL,
      chunk_index INTEGER NOT NULL,
      char_count INTEGER NOT NULL,
      embedding vector(1536), -- OpenAI embeddings
      metadata JSONB,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );
    

    Chunking the report content allows us to perform efficient vector searches. Each chunk is associated with its report and section, making it easy to trace back to the original context. The embedding column stores the vector representation of the text, crucial for similarity searches. char_count provides additional context for ranking and relevance scoring, and metadata can include information like sentiment or topic analysis results. The chunk_index ensures the original order is maintained, essential for reconstructing the report’s structure if needed.

  • report_access_log Table: This table tracks report access for auditing purposes. It includes the access log ID, report ID, user who accessed the report, access type (view, search, share), and access timestamp.

    -- Track report access for audit
    CREATE TABLE report_access_log (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      report_id UUID REFERENCES user_reports(id),
      accessed_by TEXT NOT NULL,
      access_type TEXT NOT NULL, -- view, search, share
      created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );
    

    Auditing is critical for security and compliance. This table helps us track who accessed which reports and how, providing a clear record of access patterns. This data can be used to identify potential security breaches, ensure compliance with data protection regulations, and understand user engagement with reports. By logging access_type, we can distinguish between views, searches, and shares, providing a granular view of report usage.

2. Processing Pipeline: From Raw HTML to Searchable Chunks

Next up, we have the processing pipeline. This is the engine that takes raw HTML from the TMS API and transforms it into searchable chunks stored in our database. Here’s how it works:

interface ReportProcessor {
  // Main processing flow
  async processReport(subscriptionId: string, userId: string): Promise<void> {
    // 1. Fetch HTML from TMS API
    const html = await fetchTMSReport(subscriptionId);
    
    // 2. Store raw report
    const report = await storeRawReport(userId, html);
    
    // 3. Queue async processing
    await queueReportProcessing(report.id);
  }
  
  // Async processing job
  async processReportJob(reportId: string): Promise<void> {
    // 1. Parse HTML and extract sections
    const sections = parseReportSections(html);
    
    // 2. Download and store images
    const imageMap = await downloadImages(html);
    
    // 3. Update HTML with local URLs
    const processedHtml = replaceImageUrls(html, imageMap);
    
    // 4. Extract and chunk text content
    const chunks = await chunkSections(sections);
    
    // 5. Generate embeddings
    const embeddings = await generateEmbeddings(chunks);
    
    // 6. Store in vector DB
    await storeChunksWithEmbeddings(chunks, embeddings);
    
    // 7. Update report status
    await updateReportStatus(reportId, 'completed');
  }
}
  • Main Processing Flow: The processReport function is the entry point. It fetches HTML from the TMS API, stores the raw report in the user_reports table, and queues an async job for further processing.
  • Async Processing Job: The processReportJob function handles the heavy lifting. It parses the HTML, extracts sections, downloads and stores images, updates HTML with local image URLs, extracts and chunks text content, generates embeddings using OpenAI, stores chunks with embeddings in the report_chunks table, and updates the report status to 'completed'.

3. Vector Search Integration: Finding Needles in Haystacks

Now, let’s talk about how we’ll search within these reports. Vector search allows us to find information based on meaning, not just keywords. This is a game-changer for contextual understanding. Here’s the interface for our ReportSearchService:

interface ReportSearchService {
  // Search within user's own reports
  async searchUserReports(
    userId: string, 
    query: string, 
    reportTypes?: string[]
  ): Promise<SearchResult[]>;
  
  // Search within team reports (for managers)
  async searchTeamReports(
    teamId: string, 
    query: string,
    managerId: string
  ): Promise<SearchResult[]>;
  
  // Hybrid search combining TMS knowledge + reports
  async hybridSearch(
    query: string,
    context: UserContext
  ): Promise<HybridSearchResult>;
}
  • searchUserReports: This function allows users to search within their own reports. You can specify a query and optionally filter by report types.
  • searchTeamReports: This function allows managers to search within team reports. It requires a team ID, query, and manager ID for access control.
  • hybridSearch: This function combines TMS knowledge with report content for a more comprehensive search. It takes a query and user context as input.

4. Access Control: Keeping Things Secure

Security is paramount. We need to ensure that users can only access the reports they’re authorized to see. We’ll implement role-based access control using Supabase’s Row Level Security (RLS) policies. Here are the policies we’ll create:

-- RLS Policy: Users see own reports
CREATE POLICY user_own_reports ON user_reports
  FOR ALL USING (auth.uid() = user_id);

-- RLS Policy: Managers see team reports
CREATE POLICY manager_team_reports ON user_reports
  FOR SELECT USING (
    EXISTS (
      SELECT 1 FROM team_members
      WHERE team_members.team_id = user_reports.team_id
      AND team_members.user_id = auth.uid()
      AND team_members.role = 'manager'
    )
  );

-- RLS Policy: Report chunks inherit report access
CREATE POLICY report_chunks_access ON report_chunks
  FOR SELECT USING (
    EXISTS (
      SELECT 1 FROM user_reports
      WHERE user_reports.id = report_chunks.report_id
      AND (
        user_reports.user_id = auth.uid()
        OR EXISTS (
          SELECT 1 FROM team_members
          WHERE team_members.team_id = user_reports.team_id
          AND team_members.user_id = auth.uid()
          AND team_members.role = 'manager'
        )
      )
    )
  );
  • user_own_reports: This policy ensures that users can only see their own reports.
  • manager_team_reports: This policy allows managers to see reports associated with their teams.
  • report_chunks_access: This policy ensures that access to report chunks is consistent with access to the parent reports.

5. Agent Configuration Enhancement: Making Agents Smarter

Finally, we’ll enhance our agent configuration to leverage the new report storage system. This will allow our agents to provide more informed and contextual responses. Here’s how we’ll enhance the agent config:

interface EnhancedAgentConfig {
  // Existing config...
  knowledgeConfig: {
    sources: ['tms_ip', 'user_reports'];
    reportAccess: {
      enabled: boolean;
      level: 'own' | 'team' | 'organization';
      types: ('TMP' | 'QO2' | 'TeamSignals')[];
      includeHistorical: boolean;
      maxAge?: number; // days
    };
  };
}
  • We'll add user_reports as a knowledge source for our agents.
  • We'll introduce a reportAccess configuration that allows us to specify the level of report access (own, team, or organization), the types of reports to include (TMP, QO2, TeamSignals), whether to include historical reports, and a maximum age for reports.

Implementation Plan: Our Roadmap to Success

To make this happen, we’ll follow a phased implementation plan. This will allow us to roll out the new system incrementally, ensuring that each component is working correctly before moving on to the next. Here’s our roadmap:

Phase 1: Basic Storage (Week 1-2)

  • [ ] Create database schema
  • [ ] Implement report fetching service
  • [ ] Build image download pipeline
  • [ ] Create basic storage APIs
  • [ ] Add report viewer component

Phase 2: Processing Pipeline (Week 3-4)

  • [ ] Implement HTML parser for report sections
  • [ ] Build chunking algorithm
  • [ ] Create embedding generation service
  • [ ] Implement async job processing
  • [ ] Add processing status UI

Phase 3: Vector Search (Week 5-6)

  • [ ] Integrate with Supabase Vector
  • [ ] Build search APIs
  • [ ] Implement similarity search
  • [ ] Add search UI components
  • [ ] Create search result ranking

Phase 4: Agent Integration (Week 7-8)

  • [ ] Enhance agent configuration
  • [ ] Implement context injection
  • [ ] Build hybrid search
  • [ ] Add report citations
  • [ ] Test with debrief agent

Phase 5: Access Control (Week 9-10)

  • [ ] Implement RLS policies
  • [ ] Add team management
  • [ ] Build access audit logs
  • [ ] Create permission UI
  • [ ] Add security tests

Success Criteria: How We’ll Measure Our Progress

To ensure we’re on the right track, we’ll define clear success criteria across functional, performance, and security requirements. Here’s what we’re aiming for:

1. Functional Requirements

  • Reports stored within 2 minutes of generation
  • All images downloaded and accessible offline
  • Search returns relevant chunks within 500ms
  • Agents cite specific report sections

2. Performance Requirements

  • Report processing < 30s for typical report
  • Vector search < 200ms p95
  • Image loading < 100ms from storage

3. Security Requirements

  • Users only access own reports
  • Managers access team reports
  • All access is audited
  • No data leakage between teams

Testing Approach: Ensuring Quality and Reliability

Testing is crucial to ensure our new system works as expected. We’ll use a multi-faceted testing approach, including unit tests, integration tests, performance tests, and security tests. Here’s a breakdown:

1. Unit Tests

  • HTML parsing accuracy
  • Chunking algorithm
  • Access control logic

2. Integration Tests

  • End-to-end report processing
  • Vector search accuracy
  • Agent context integration

3. Performance Tests

  • Load testing with 1000+ reports
  • Concurrent processing
  • Search response times

4. Security Tests

  • Access control validation
  • SQL injection prevention
  • Cross-team data isolation

Dependencies: What We Need to Make This Work

To implement this system, we’ll rely on several key dependencies:

  • Supabase Vector extension
  • OpenAI API for embeddings
  • Background job processing (Supabase Edge Functions)
  • Additional storage for images

Risks and Mitigations: Planning for the Unexpected

No project is without risks. We need to anticipate potential challenges and have mitigation strategies in place. Here are some risks we’ve identified and how we plan to address them:

1. Storage Costs

  • Mitigate with image compression and retention policies

2. Processing Time

  • Use queue system and show progress indicators

3. Search Quality

  • Implement feedback loop for search improvement

4. Data Privacy

  • Encrypt sensitive report data at rest

Future Enhancements: What’s Next?

This is just the beginning! Once we have the core system in place, there are many exciting enhancements we can explore. Here are a few ideas:

  • Report comparison over time
  • Team performance aggregations
  • Export to various formats
  • Automated insights generation
  • Report sharing with external parties

Conclusion: A Smarter Way to Handle Reports

Implementing persistent report storage with vector search and role-based access is a significant step forward for our team. It will make our reports more accessible, searchable, and secure. By following our implementation plan and addressing potential risks, we’re confident that we can deliver a system that meets our needs and sets us up for future success. Let's get to work, guys!