Implement Persistent Report Storage With Vector Search And Role-Based Access A Comprehensive Guide
Hey guys! Today, we're diving deep into a crucial upgrade for our TMS reports: implementing persistent storage with vector search and role-based access. This is going to revolutionize how we handle reports, making them more accessible, searchable, and secure. Let’s break it down!
Problem Statement: Why We Need Persistent Report Storage
Currently, our TMS reports (TMP, QO2, Team Signals) are fetched on-demand from external APIs. While this works, it’s not ideal. Think of it like ordering a pizza every time you're hungry instead of having some leftovers in the fridge. Here are the key limitations we’re facing:
- No Historical Access: Reports are regenerated each time, so we lose valuable historical snapshots. Imagine trying to track progress over time but only having the most recent data—frustrating, right?
- External Dependencies: Our report images rely on external API availability. If those APIs go down, so do our reports. We need to reduce this dependency.
- No Contextual Search: Agents can't search within user reports for specific answers. This means digging through reports manually, which is time-consuming and inefficient.
- Limited Access Control: We lack role-based access to team reports for managers. This is a security concern and limits collaboration.
- Performance Issues: Re-fetching and re-processing reports each time someone accesses them is a drag on performance. We need to make this faster and smoother.
Proposed Solution: A Comprehensive Report Storage System
So, what’s the fix? We need a comprehensive report storage system that addresses these limitations head-on. Our solution will:
- Persist report HTML and images in Supabase. This means storing our reports in a robust and scalable database.
- Chunk report content for vector search. This will allow us to search within reports for specific information quickly and accurately.
- Integrate with agents for contextual responses. Our agents will be able to pull relevant information from reports to provide better assistance.
- Enforce role-based access control. Managers will have access to team reports, ensuring data security and proper access levels.
- Provide offline access to reports. This means users can view reports even without an internet connection.
Technical Architecture: How It All Works
Let's get into the nitty-gritty of how this system will be structured. We'll be building a robust architecture with several key components:
1. Database Schema: The Blueprint of Our Data
We'll start by defining our database schema. Think of this as the blueprint for how our data will be stored and organized. Here’s a breakdown of the tables we'll need:
-
user_reports Table: This table stores the complete reports. Each report will have a unique ID, user ID, team ID, organization ID, report type, subscription ID, template ID, raw HTML, processed HTML, metadata (like scores and insights), processing status, and timestamps for creation and updates.
-- Store complete reports CREATE TABLE user_reports ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id TEXT NOT NULL, team_id TEXT, organization_id TEXT NOT NULL, report_type TEXT NOT NULL CHECK (report_type IN ('TMP', 'QO2', 'TeamSignals', 'TeamSignals360')), subscription_id TEXT NOT NULL, template_id TEXT NOT NULL, raw_html TEXT NOT NULL, processed_html TEXT, -- HTML with local image URLs metadata JSONB, -- scores, profile, insights, etc. processing_status TEXT DEFAULT 'pending', processed_at TIMESTAMP WITH TIME ZONE, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() );
This table is the central hub for all report data, ensuring we capture every detail from the raw HTML to processed versions and relevant metadata. The
CHECK
constraint onreport_type
ensures we only store valid report types, maintaining data integrity. Timestamps help us track when reports were created and last updated, crucial for auditing and historical analysis. Themetadata
column, usingJSONB
, allows us to store flexible data structures like scores, profiles, and insights, making the schema adaptable to future needs. -
report_images Table: This table stores downloaded images associated with reports. It includes the image ID, report ID, original URL, storage path (where the image is stored in Supabase), image type (wheel, graph, asset), alt text, description, metadata (like chart parameters), and creation timestamp.
-- Store downloaded images CREATE TABLE report_images ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), report_id UUID REFERENCES user_reports(id) ON DELETE CASCADE, original_url TEXT NOT NULL, storage_path TEXT NOT NULL, -- Supabase storage path image_type TEXT NOT NULL, -- wheel, graph, asset alt_text TEXT, description TEXT, metadata JSONB, -- chart parameters created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() );
By storing images separately, we optimize storage and retrieval, especially since images can be large. The
original_url
is essential for tracking the source, whilestorage_path
points to the location in our Supabase storage. Includingalt_text
anddescription
enhances accessibility and SEO, making the reports more user-friendly and discoverable. Theimage_type
helps categorize images for different use cases, and themetadata
column can store chart-specific parameters for dynamic rendering or analysis. -
report_chunks Table: This is where we store chunked content for vector search. It includes the chunk ID, report ID, section ID, section title, content (the actual chunk of text), chunk index, character count, embedding (generated using OpenAI), metadata, and creation timestamp.
-- Store chunked content for vector search CREATE TABLE report_chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), report_id UUID REFERENCES user_reports(id) ON DELETE CASCADE, section_id TEXT NOT NULL, section_title TEXT NOT NULL, content TEXT NOT NULL, chunk_index INTEGER NOT NULL, char_count INTEGER NOT NULL, embedding vector(1536), -- OpenAI embeddings metadata JSONB, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() );
Chunking the report content allows us to perform efficient vector searches. Each chunk is associated with its report and section, making it easy to trace back to the original context. The
embedding
column stores the vector representation of the text, crucial for similarity searches.char_count
provides additional context for ranking and relevance scoring, andmetadata
can include information like sentiment or topic analysis results. Thechunk_index
ensures the original order is maintained, essential for reconstructing the report’s structure if needed. -
report_access_log Table: This table tracks report access for auditing purposes. It includes the access log ID, report ID, user who accessed the report, access type (view, search, share), and access timestamp.
-- Track report access for audit CREATE TABLE report_access_log ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), report_id UUID REFERENCES user_reports(id), accessed_by TEXT NOT NULL, access_type TEXT NOT NULL, -- view, search, share created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() );
Auditing is critical for security and compliance. This table helps us track who accessed which reports and how, providing a clear record of access patterns. This data can be used to identify potential security breaches, ensure compliance with data protection regulations, and understand user engagement with reports. By logging
access_type
, we can distinguish between views, searches, and shares, providing a granular view of report usage.
2. Processing Pipeline: From Raw HTML to Searchable Chunks
Next up, we have the processing pipeline. This is the engine that takes raw HTML from the TMS API and transforms it into searchable chunks stored in our database. Here’s how it works:
interface ReportProcessor {
// Main processing flow
async processReport(subscriptionId: string, userId: string): Promise<void> {
// 1. Fetch HTML from TMS API
const html = await fetchTMSReport(subscriptionId);
// 2. Store raw report
const report = await storeRawReport(userId, html);
// 3. Queue async processing
await queueReportProcessing(report.id);
}
// Async processing job
async processReportJob(reportId: string): Promise<void> {
// 1. Parse HTML and extract sections
const sections = parseReportSections(html);
// 2. Download and store images
const imageMap = await downloadImages(html);
// 3. Update HTML with local URLs
const processedHtml = replaceImageUrls(html, imageMap);
// 4. Extract and chunk text content
const chunks = await chunkSections(sections);
// 5. Generate embeddings
const embeddings = await generateEmbeddings(chunks);
// 6. Store in vector DB
await storeChunksWithEmbeddings(chunks, embeddings);
// 7. Update report status
await updateReportStatus(reportId, 'completed');
}
}
- Main Processing Flow: The
processReport
function is the entry point. It fetches HTML from the TMS API, stores the raw report in theuser_reports
table, and queues an async job for further processing. - Async Processing Job: The
processReportJob
function handles the heavy lifting. It parses the HTML, extracts sections, downloads and stores images, updates HTML with local image URLs, extracts and chunks text content, generates embeddings using OpenAI, stores chunks with embeddings in thereport_chunks
table, and updates the report status to 'completed'.
3. Vector Search Integration: Finding Needles in Haystacks
Now, let’s talk about how we’ll search within these reports. Vector search allows us to find information based on meaning, not just keywords. This is a game-changer for contextual understanding. Here’s the interface for our ReportSearchService:
interface ReportSearchService {
// Search within user's own reports
async searchUserReports(
userId: string,
query: string,
reportTypes?: string[]
): Promise<SearchResult[]>;
// Search within team reports (for managers)
async searchTeamReports(
teamId: string,
query: string,
managerId: string
): Promise<SearchResult[]>;
// Hybrid search combining TMS knowledge + reports
async hybridSearch(
query: string,
context: UserContext
): Promise<HybridSearchResult>;
}
- searchUserReports: This function allows users to search within their own reports. You can specify a query and optionally filter by report types.
- searchTeamReports: This function allows managers to search within team reports. It requires a team ID, query, and manager ID for access control.
- hybridSearch: This function combines TMS knowledge with report content for a more comprehensive search. It takes a query and user context as input.
4. Access Control: Keeping Things Secure
Security is paramount. We need to ensure that users can only access the reports they’re authorized to see. We’ll implement role-based access control using Supabase’s Row Level Security (RLS) policies. Here are the policies we’ll create:
-- RLS Policy: Users see own reports
CREATE POLICY user_own_reports ON user_reports
FOR ALL USING (auth.uid() = user_id);
-- RLS Policy: Managers see team reports
CREATE POLICY manager_team_reports ON user_reports
FOR SELECT USING (
EXISTS (
SELECT 1 FROM team_members
WHERE team_members.team_id = user_reports.team_id
AND team_members.user_id = auth.uid()
AND team_members.role = 'manager'
)
);
-- RLS Policy: Report chunks inherit report access
CREATE POLICY report_chunks_access ON report_chunks
FOR SELECT USING (
EXISTS (
SELECT 1 FROM user_reports
WHERE user_reports.id = report_chunks.report_id
AND (
user_reports.user_id = auth.uid()
OR EXISTS (
SELECT 1 FROM team_members
WHERE team_members.team_id = user_reports.team_id
AND team_members.user_id = auth.uid()
AND team_members.role = 'manager'
)
)
)
);
- user_own_reports: This policy ensures that users can only see their own reports.
- manager_team_reports: This policy allows managers to see reports associated with their teams.
- report_chunks_access: This policy ensures that access to report chunks is consistent with access to the parent reports.
5. Agent Configuration Enhancement: Making Agents Smarter
Finally, we’ll enhance our agent configuration to leverage the new report storage system. This will allow our agents to provide more informed and contextual responses. Here’s how we’ll enhance the agent config:
interface EnhancedAgentConfig {
// Existing config...
knowledgeConfig: {
sources: ['tms_ip', 'user_reports'];
reportAccess: {
enabled: boolean;
level: 'own' | 'team' | 'organization';
types: ('TMP' | 'QO2' | 'TeamSignals')[];
includeHistorical: boolean;
maxAge?: number; // days
};
};
}
- We'll add
user_reports
as a knowledge source for our agents. - We'll introduce a
reportAccess
configuration that allows us to specify the level of report access (own, team, or organization), the types of reports to include (TMP, QO2, TeamSignals), whether to include historical reports, and a maximum age for reports.
Implementation Plan: Our Roadmap to Success
To make this happen, we’ll follow a phased implementation plan. This will allow us to roll out the new system incrementally, ensuring that each component is working correctly before moving on to the next. Here’s our roadmap:
Phase 1: Basic Storage (Week 1-2)
- [ ] Create database schema
- [ ] Implement report fetching service
- [ ] Build image download pipeline
- [ ] Create basic storage APIs
- [ ] Add report viewer component
Phase 2: Processing Pipeline (Week 3-4)
- [ ] Implement HTML parser for report sections
- [ ] Build chunking algorithm
- [ ] Create embedding generation service
- [ ] Implement async job processing
- [ ] Add processing status UI
Phase 3: Vector Search (Week 5-6)
- [ ] Integrate with Supabase Vector
- [ ] Build search APIs
- [ ] Implement similarity search
- [ ] Add search UI components
- [ ] Create search result ranking
Phase 4: Agent Integration (Week 7-8)
- [ ] Enhance agent configuration
- [ ] Implement context injection
- [ ] Build hybrid search
- [ ] Add report citations
- [ ] Test with debrief agent
Phase 5: Access Control (Week 9-10)
- [ ] Implement RLS policies
- [ ] Add team management
- [ ] Build access audit logs
- [ ] Create permission UI
- [ ] Add security tests
Success Criteria: How We’ll Measure Our Progress
To ensure we’re on the right track, we’ll define clear success criteria across functional, performance, and security requirements. Here’s what we’re aiming for:
1. Functional Requirements
- Reports stored within 2 minutes of generation
- All images downloaded and accessible offline
- Search returns relevant chunks within 500ms
- Agents cite specific report sections
2. Performance Requirements
- Report processing < 30s for typical report
- Vector search < 200ms p95
- Image loading < 100ms from storage
3. Security Requirements
- Users only access own reports
- Managers access team reports
- All access is audited
- No data leakage between teams
Testing Approach: Ensuring Quality and Reliability
Testing is crucial to ensure our new system works as expected. We’ll use a multi-faceted testing approach, including unit tests, integration tests, performance tests, and security tests. Here’s a breakdown:
1. Unit Tests
- HTML parsing accuracy
- Chunking algorithm
- Access control logic
2. Integration Tests
- End-to-end report processing
- Vector search accuracy
- Agent context integration
3. Performance Tests
- Load testing with 1000+ reports
- Concurrent processing
- Search response times
4. Security Tests
- Access control validation
- SQL injection prevention
- Cross-team data isolation
Dependencies: What We Need to Make This Work
To implement this system, we’ll rely on several key dependencies:
- Supabase Vector extension
- OpenAI API for embeddings
- Background job processing (Supabase Edge Functions)
- Additional storage for images
Risks and Mitigations: Planning for the Unexpected
No project is without risks. We need to anticipate potential challenges and have mitigation strategies in place. Here are some risks we’ve identified and how we plan to address them:
1. Storage Costs
- Mitigate with image compression and retention policies
2. Processing Time
- Use queue system and show progress indicators
3. Search Quality
- Implement feedback loop for search improvement
4. Data Privacy
- Encrypt sensitive report data at rest
Future Enhancements: What’s Next?
This is just the beginning! Once we have the core system in place, there are many exciting enhancements we can explore. Here are a few ideas:
- Report comparison over time
- Team performance aggregations
- Export to various formats
- Automated insights generation
- Report sharing with external parties
Conclusion: A Smarter Way to Handle Reports
Implementing persistent report storage with vector search and role-based access is a significant step forward for our team. It will make our reports more accessible, searchable, and secure. By following our implementation plan and addressing potential risks, we’re confident that we can deliver a system that meets our needs and sets us up for future success. Let's get to work, guys!