[EXTRA] Music Search API

A complete Spring Boot application for semantic music search using OpenAI embeddings and Pinecone vector database.

✅ Implementation Status

All components from the original specification have been successfully implemented:

  • Maven Project Structure - Complete with all dependencies

  • Track Model - Comprehensive data model with all music attributes

  • Configuration Classes - OpenAI and Pinecone configurations

  • CSV Loader Service - Robust CSV parsing with error handling

  • Embedding Service - OpenAI integration with caching

  • Pinecone Service - Vector operations and similarity search

  • Index Controller - CSV indexing with progress tracking

  • Search Controller - Semantic search with suggestions

  • Application Properties - Complete configuration setup

  • Sample Data - Sample CSV with popular tracks

🚀 Quick Start

1. Prerequisites

  • Java 21

  • Maven 3.6+

  • OpenAI API Key

  • Pinecone Account and Index

2. Environment Variables

export OPENAI_API_KEY=sk-your-openai-api-key
export PINECONE_API_KEY=your-pinecone-api-key
export PINECONE_ENVIRONMENT=gcp-starter
export PINECONE_INDEX_NAME=music-tracks

3. Build and Run

# Build the application
mvn clean package

# Run the application
mvn spring-boot:run

# Or run the JAR
java -jar target/music-search-0.0.1-SNAPSHOT.jar

The API will be available at http://localhost:8080

📡 API Endpoints

Index Management

Index CSV Data

POST /api/index?csvFileName=tracks_sample.csv

Response:

{
  "status": "success",
  "tracksLoaded": 10,
  "embeddingsGenerated": 10,
  "tracksIndexed": 10,
  "processingTimeMs": 15420,
  "message": "Successfully indexed 10 tracks in 15.42 seconds"
}

Get Index Status

GET /api/index/status

Clear Embedding Cache

DELETE /api/index/cache

Health Check

GET /api/index/health

Search Operations

POST /api/search
Content-Type: application/json

{
  "query": "energetic pop music for dancing",
  "topK": 5
}

Response:

{
  "status": "success",
  "query": "energetic pop music for dancing",
  "matches": [
    {
      "id": "6f807x0ima9a1j3VPbc7VN",
      "score": 0.91,
      "metadata": {
        "track_name": "I Don't Care (with Justin Bieber) - Loud Luxury Remix",
        "track_artist": "Ed Sheeran",
        "playlist_genre": "pop",
        "playlist_subgenre": "dance pop",
        "energy": 0.916,
        "tempo": 122.036
      }
    }
  ],
  "totalResults": 1,
  "searchTimeMs": 245
}

Search Suggestions

GET /api/search/suggestions?q=pop

Search Health Check

GET /api/search/health

🎵 Sample Search Queries

Try these example searches:

  1. "upbeat dance music" - Finds high-energy, danceable tracks

  2. "relaxing acoustic songs" - Finds calm, acoustic tracks

  3. "workout motivation" - Finds high-tempo, energetic tracks

  4. "romantic ballads" - Finds emotional, slow-tempo tracks

  5. "party anthems" - Finds upbeat, danceable party music

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   CSV Loader    │───▶│ Embedding       │───▶│   Pinecone      │
│                 │    │ Service         │    │   Service       │
│ - Parse CSV     │    │                 │    │                 │
│ - Validate data │    │ - OpenAI API    │    │ - Vector ops    │
│ - Error handling│    │ - Caching       │    │ - Similarity    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘

                    ┌─────────────────┐
                    │   Controllers   │
                    │                 │
                    │ - Index API     │
                    │ - Search API    │
                    └─────────────────┘

🔧 Configuration

OpenAI Settings

  • Model: text-embedding-3-small

  • Timeout: 60 seconds (configurable)

  • Caching: Enabled with 1000 entry limit

Pinecone Settings

  • Environment: Configurable (gcp-starter, aws, etc.)

  • Index Name: Configurable

  • Vector Dimensions: 1536 (OpenAI embedding size)

CSV Processing

  • Batch Size: 100 tracks per batch

  • Error Handling: Continues processing on individual errors

  • Field Mapping: Automatic based on header names

📊 Features

✅ Implemented Features

  • Semantic Search: Natural language music queries

  • Batch Processing: Efficient CSV indexing

  • Caching: Embedding result caching

  • Error Handling: Robust error recovery

  • Health Checks: Service monitoring endpoints

  • CORS Support: Cross-origin requests

  • Logging: Comprehensive logging

  • Configuration: Environment-based config

🔄 Data Flow

  1. Load CSV → Parse tracks from CSV file

  2. Generate Embeddings → Create vector representations

  3. Store in Pinecone → Index vectors with metadata

  4. Search Query → Generate query embedding

  5. Find Similar → Cosine similarity search

  6. Return Results → Formatted track information

🧪 Testing

Manual Testing

  1. Start the application

  2. Index the sample data: POST /api/index?csvFileName=tracks_sample.csv

  3. Search for music: POST /api/search with various queries

  4. Check status: GET /api/index/status

Sample Queries to Test

# High energy dance music
curl -X POST http://localhost:8080/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "high energy dance music", "topK": 3}'

# Relaxing acoustic songs
curl -X POST http://localhost:8080/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "relaxing acoustic songs", "topK": 3}'

# Workout motivation
curl -X POST http://localhost:8080/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "workout motivation music", "topK": 3}'

📝 Notes

  • The application uses OpenAI's text-embedding-3-small model for optimal performance

  • Pinecone handles vector similarity search with high performance

  • The system is designed to handle large music catalogs efficiently

  • All API endpoints include comprehensive error handling

  • The application supports both development and production configurations

🚀 Next Steps

  1. Set up your API keys in environment variables

  2. Create your Pinecone index with 1536 dimensions

  3. Add your music CSV file to the resources directory

  4. Test the API endpoints with sample queries

  5. Deploy to production with proper monitoring

The implementation is complete and ready for use! 🎉

Última actualización