Definition
AI 연구 인프라(Research Infrastructure for AI)는 AI 에이전트가 과학 연구에 효과적으로 참여할 수 있도록 데이터와 시스템을 구축·정리하는 기술적 기반이다.
The Current Reality: Labs Are Not Ready
Typical Lab Data Setup ❌
연구원에게 "데이터베이스 보여주세요" → 보여주는 것:
├─ Folder of Excel files
│ ├─ cells merged randomly (병합된 셀)
│ ├─ inconsistent column names (불일치 컬럼명)
│ ├─ special characters in filenames (특수문자: #, $, %, etc)
│ ├─ dates in 10 different formats
│ └─ handwritten notes in margins
├─ PDF papers with embedded images
│ └─ Impossible for AI to extract data
├─ PowerPoint presentations with figures
│ └─ Data not in machine-readable format
└─ Scattered lab notebooks (physical & digital)
└─ Inconsistent recording standards
Why This Fails for AI ❌
AI reads these formats and sees:
├─ "What is this merged cell? Data? Header?"
├─ "Column 'exp_data' vs 'ExperimentData' vs 'ED' — are they the same?"
├─ "SpecialChar$#%.txt — valid filename?"
├─ "Date format: MM/DD/YY or DD/MM/YY?"
└─ → Complete incomprehension
Human (reading same data):
├─ Knows from context what things mean
├─ Understands implied structure
├─ Can infer missing information
├─ → Natural comprehension
The Metaphor: Learning a New Language
Imagine your brilliant new colleague speaks **Slovak only**.
To work with this colleague:
├─ You must learn Slovak (or hire translator)
├─ Your team must learn Slovak
├─ All meetings must be in Slovak
├─ Documents must be in Slovak
└─ Massive infrastructure overhaul required
Similarly, to work with AI:
├─ Your lab must speak "AI-readable language"
├─ All data must be in machine-readable format
├─ Databases must be AI-comprehensible
├─ Huge investment in standardization
└─ → This is what [[wiki/concepts/Research-Infrastructure-for-AI]] means
What AI-Ready Infrastructure Looks Like
1. Structured Data Format ✅
Instead of: Excel file with merged cells
Use:
├─ CSV/JSON with strict schema
├─ Clear column definitions
├─ Consistent data types (INT, FLOAT, DATE, STRING)
├─ NO merged cells, special formatting
└─ Machine-parseable (AI can read 100%)
2. Standardized Naming Conventions ✅
Instead of: "exp_data.xlsx", "ExperimentData.csv", "ED#2025.txt"
Use:
├─ Consistent naming: "experiment_data_20250504.csv"
├─ Clear semantic meaning
├─ Parseable date formats (ISO 8601: YYYY-MM-DD)
├─ No special characters (only: letters, numbers, _, -)
└─ Machine + Human readable
3. Centralized Database ✅
Instead of: Scattered Excel files across shared drive
Use:
├─ Centralized Database (PostgreSQL, MongoDB, etc)
├─ Single source of truth
├─ Access control & audit logs
├─ Backup & recovery
├─ API for programmatic access (AI can query directly)
└─ Real-time data synchronization
4. Ontology-Based Schema ✅
Beyond just tables:
├─ Define concepts: Experiment, Measurement, Parameter
├─ Define relationships: Experiment → Measurement → Parameter
├─ Enforce consistency: All measurements must have units, timestamps, source
├─ Enable reasoning: AI can infer implications
└─ [[wiki/concepts/Ontology]] structures make data AI-comprehensible
5. Machine-Readable Literature ✅
Instead of: PDF papers, PowerPoint presentations
Use:
├─ Semantic markup (RDF, JSON-LD)
├─ Structured metadata
├─ Linked references (each citation machine-readable)
├─ Extracted figures as data (not images)
└─ AI can analyze 10,000 papers in minutes
Implementation Challenges
1. Cost & Effort
"This is a massive, expensive undertaking"
Reality:
├─ Database infrastructure: $100K - $1M+ (setup & maintenance)
├─ Staff training: months to years
├─ Data migration: substantial effort
├─ Ongoing standardization: continuous cost
└─ → Major investment required
2. Legacy Data
Existing research data:
├─ Decades of inconsistent formats
├─ Missing metadata
├─ Ambiguous formats
├─ Manual extraction & cleaning required
└─ → One-time massive effort to migrate
3. Researcher Resistance
Challenges:
├─ Researchers want flexibility (AI needs rigidity)
├─ "Why do I need to follow this format?"
├─ Learning new systems takes time
├─ Feels like bureaucracy to scientists
└─ → Cultural change management required
Why It’s Worth It
Benefits Realization
Once infrastructure ready:
├─ AI processes data instantly (not days)
├─ No data re-entry or manual processing
├─ Consistent data quality across team
├─ Reproducible research
├─ [[Automated Scientist]] becomes possible
└─ → Return on investment enormous
Research Impact
Before:
├─ 1 Postdoc → 1 paper/2 years
└─ Limited by human capacity
After:
├─ 1 Postdoc + AI → 10 papers/year
├─ [[wiki/concepts/Research-Automation-Pipeline]] fully activated
├─ [[wiki/concepts/Human-AI-Research-Partnership]] realized
└─ Exponential discovery acceleration
Practical Roadmap
Phase 1: Assessment (Month 1-2)
├─ Audit current data landscape
├─ Identify bottlenecks
├─ Define AI-readiness requirements
└─ Cost-benefit analysis
Phase 2: Pilot (Month 3-6)
├─ Select 1-2 research areas
├─ Build database + schema
├─ Migrate sample data
├─ Test with AI tools
└─ Iterate and improve
Phase 3: Full Implementation (Month 7-18)
├─ Scale infrastructure
├─ Migrate all legacy data
├─ Train entire team
├─ Establish governance
└─ Monitor and optimize
Phase 4: Integration (Month 19+)
├─ Deploy [[Automated Scientist]]
├─ Enable [[wiki/concepts/Human-AI-Research-Partnership]]
├─ Continuous improvement
└─ Competitive advantage established
Success Indicators
Infrastructure is "AI-ready" when:
├─ [ ] All experimental data in structured format
├─ [ ] Zero manual data entry for AI processing
├─ [ ] AI can query any dataset directly
├─ [ ] Researcher describes experiment once, system captures all metadata
├─ [ ] Literature seamlessly integrated with experimental data
├─ [ ] Reproducibility fully automated
└─ → [[Automated Scientist]] can operate independently
References
- Automated Scientist — 기술적 기반 필요 대상
- Research-Automation-Pipeline — 인프라의 활용처
- Human-AI-Research-Partnership — 인프라 없으면 불가능
- Ontology — 데이터 구조화의 핵심
- ai-automated-scientist.md — 인프라의 중요성 강조