제조 MLOps 개요

1MLOps란 무엇인가

MLOps(Machine Learning Operations)는 ML 모델의 개발, 배포, 운영을 체계화하는 방법론입니다. DevOps의 원칙을 머신러닝에 적용하여 모델의 전체 생명주기를 자동화합니다.

제조 환경에서 MLOps는 특히 중요합니다. 수천 개의 센서에서 생성되는 데이터, 실시간 품질 검사, 예측 정비 등 다양한 AI 모델이 24/7 운영되어야 하며, 공정 변화에 따른 지속적인 모델 업데이트가 필요하기 때문입니다.

Manufacturing MLOps Architecture

Manufacturing MLOps Platform

Data Layer

Sensor DBTimescaleDB

MES DBPostgreSQL

ERP DataSAP

Vision DBMinIO

▼

Feature Store (Feast)

Online StoreRedis

Offline StoreS3

▼ ▼ ▼

Training PipelineKubeflow

Validation PipelineMLflow

Inference PipelineTriton

▼

Model Registry (MLflow)

Staging

→

Production

→

Archived

Metadata

▼

Monitoring Layer (Prometheus + Grafana)

Model Performance

Data Drift

System Metrics

Alerts

Data Pipeline

센서 데이터 수집부터 피처 스토어까지 자동화된 데이터 흐름

Model Training

재현 가능한 학습 환경과 하이퍼파라미터 추적

Model Serving

실시간 추론을 위한 고성능 서빙 인프라

Monitoring

모델 성능과 데이터 드리프트 실시간 감지

2제조 MLOps 프로젝트 구조

체계적인 MLOps 프로젝트는 명확한 디렉토리 구조와 설정 파일을 갖춰야 합니다. 다음은 제조 AI 프로젝트의 표준 구조입니다.

Project Structure
manufacturing-ai-project/
├── .github/
│   └── workflows/
│       ├── ci.yml              # CI 파이프라인
│       ├── cd.yml              # CD 파이프라인
│       └── model-training.yml  # 모델 학습 자동화
├── configs/
│   ├── training/
│   │   ├── base_config.yaml
│   │   └── experiment_001.yaml
│   ├── inference/
│   │   └── triton_config.pbtxt
│   └── monitoring/
│       └── alerts.yaml
├── data/
│   ├── raw/                    # 원본 데이터
│   ├── processed/              # 전처리된 데이터
│   └── features/               # 피처 데이터
├── src/
│   ├── data/
│   │   ├── __init__.py
│   │   ├── loader.py           # 데이터 로더
│   │   ├── preprocessor.py     # 전처리
│   │   └── feature_store.py    # 피처 스토어 연동
│   ├── models/
│   │   ├── __init__.py
│   │   ├── base_model.py       # 베이스 모델 클래스
│   │   ├── defect_detector.py  # 결함 탐지 모델
│   │   └── predictive_maint.py # 예측 정비 모델
│   ├── training/
│   │   ├── __init__.py
│   │   ├── trainer.py          # 학습 로직
│   │   └── callbacks.py        # 콜백 함수
│   ├── inference/
│   │   ├── __init__.py
│   │   ├── predictor.py        # 추론 로직
│   │   └── batch_inference.py  # 배치 추론
│   └── utils/
│       ├── __init__.py
│       ├── logger.py           # 로깅
│       └── metrics.py          # 메트릭 계산
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_model_development.ipynb
│   └── 03_model_evaluation.ipynb
├── tests/
│   ├── unit/
│   │   └── test_preprocessor.py
│   └── integration/
│       └── test_pipeline.py
├── docker/
│   ├── Dockerfile.training
│   ├── Dockerfile.inference
│   └── docker-compose.yml
├── kubernetes/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
├── mlflow/
│   └── MLproject
├── pyproject.toml
├── requirements.txt
└── README.md

MLOps 성숙도 모델

Level 0: 수동 학습, 수동 배포 → Level 1: 자동화된 학습 파이프라인 → Level 2: CI/CD 통합, 자동 재학습 → Level 3: 완전 자동화된 ML 시스템

3설정 기반 학습 관리

실험의 재현성을 위해 모든 학습 파라미터는 설정 파일로 관리합니다. YAML 형식의 설정 파일을 사용하면 실험 추적과 버전 관리가 용이합니다.

configs/training/experiment_001.yaml
# Manufacturing AI Training Configuration
experiment:
  name: "defect_detection_v2"
  description: "YOLOv8 기반 표면 결함 탐지 모델"
  tags:
    - production
    - defect_detection
    - yolov8

data:
  train_path: "s3://manufacturing-data/train/"
  val_path: "s3://manufacturing-data/val/"
  test_path: "s3://manufacturing-data/test/"
  image_size: [640, 640]
  batch_size: 32
  num_workers: 8
  augmentation:
    horizontal_flip: true
    vertical_flip: false
    rotation_range: 15
    brightness_range: [0.8, 1.2]

model:
  architecture: "yolov8m"
  pretrained: true
  num_classes: 5
  classes:
    - scratch
    - dent
    - crack
    - contamination
    - normal

training:
  epochs: 100
  optimizer:
    name: "AdamW"
    lr: 0.001
    weight_decay: 0.0005
  scheduler:
    name: "CosineAnnealingLR"
    T_max: 100
    eta_min: 0.00001
  early_stopping:
    patience: 10
    min_delta: 0.001
  checkpoint:
    save_best: true
    save_frequency: 5

infrastructure:
  device: "cuda"
  distributed: true
  num_gpus: 4
  mixed_precision: true

logging:
  mlflow_tracking_uri: "http://mlflow-server:5000"
  experiment_name: "defect_detection"
  log_frequency: 100

src/training/config_loader.py (수도코드)
# =============================================
# 실험 설정 로더 - 수도코드
# YAML 설정 파일을 로드하여 학습 파라미터 관리
# =============================================

# 1. 설정 구조체 정의 (데이터, 모델, 학습 설정 분리)
구조체 DataConfig:
    train_path, val_path, test_path  # 데이터 경로
    image_size, batch_size           # 데이터 크기 설정
    augmentation                     # 데이터 증강 옵션

구조체 ModelConfig:
    architecture   # 모델 아키텍처 (예: "yolov8m")
    num_classes    # 분류 클래스 수
    classes        # 클래스 이름 목록

구조체 TrainingConfig:
    epochs, optimizer, scheduler     # 학습 파라미터
    early_stopping, checkpoint       # 조기 종료 및 체크포인트

# 2. 전체 실험 설정 로드
함수 load_experiment_config(yaml_path):
    # YAML 파일 읽기
    config_dict = yaml_load(yaml_path)

    # 각 섹션별로 구조체 생성
    data = DataConfig(config_dict['data'])
    model = ModelConfig(config_dict['model'])
    training = TrainingConfig(config_dict['training'])

    반환 ExperimentConfig(data, model, training)

# 3. 설정 유효성 검증
함수 validate_config(config):
    검증 config.data.batch_size > 0      # 배치 크기 양수
    검증 config.training.epochs > 0      # 에폭 수 양수
    검증 len(classes) == num_classes     # 클래스 수 일치

# 4. 사용 예시
config = load_experiment_config("experiment_001.yaml")
validate_config(config)
출력 "실험명:", config.experiment.name
출력 "모델:", config.model.architecture

4데이터 파이프라인 구축

제조 데이터는 센서, MES, ERP 등 다양한 소스에서 수집됩니다. Apache Airflow를 사용하여 데이터 수집과 전처리를 자동화합니다.

src/data/pipeline.py (수도코드)
# =============================================
# 제조 데이터 파이프라인 - 수도코드
# ETL: 센서/품질 데이터 추출 → 변환 → Feature Store 적재
# =============================================

# [1단계] 데이터 추출 (Extract)
함수 extract_sensor_data(start_date, end_date):
    # TimescaleDB에서 센서 데이터 조회
    쿼리 실행:
        SELECT timestamp, machine_id, sensor_type, sensor_value
        FROM sensor_readings
        WHERE timestamp BETWEEN start_date AND end_date
    반환 sensor_dataframe

함수 extract_quality_data(start_date, end_date):
    # MES DB에서 품질 검사 데이터 조회
    쿼리 실행:
        SELECT inspection_time, product_id, defect_type, severity
        FROM quality_inspections
        WHERE inspection_time BETWEEN start_date AND end_date
    반환 quality_dataframe

# [2단계] 데이터 변환 (Transform)
함수 transform_data(sensor_df, quality_df):
    # 센서 데이터 피벗 (sensor_type을 컬럼으로 변환)
    sensor_pivot = pivot(sensor_df, columns='sensor_type')

    # 시간 단위로 정렬 (1시간 기준)
    sensor_pivot.timestamp_hour = floor_hour(timestamp)
    quality_df.inspection_hour = floor_hour(inspection_time)

    # 센서 + 품질 데이터 조인
    merged = join(sensor_pivot, quality_df, on='hour')

    # 데이터 정제
    merged = handle_missing_values(merged)  # 결측치: 선형 보간
    merged = handle_outliers(merged)        # 이상치: IQR 클리핑

    반환 merged

# [3단계] 적재 (Load)
함수 load_to_feature_store(df, feature_group):
    # Feast Feature Store에 피처 저장
    store.write_to_offline_store(feature_group, df)

# [4단계] 파이프라인 실행 (Airflow DAG)
DAG 설정:
    schedule = "매시간 실행"
    retries = 3회

함수 run_pipeline():
    sensor_df = extract_sensor_data(지난 1시간)
    quality_df = extract_quality_data(지난 1시간)
    transformed = transform_data(sensor_df, quality_df)
    load_to_feature_store(transformed, 'manufacturing_features')

Feature Store의 이점

Feature Store(Feast, Tecton 등)를 사용하면 피처 재사용, 학습-추론 일관성 보장, 피처 버전 관리가 가능합니다. 제조 환경에서는 수백 개의 센서 피처를 여러 모델에서 공유할 수 있어 개발 효율이 크게 향상됩니다.

5CI/CD 파이프라인

GitHub Actions를 사용하여 모델 학습과 배포를 자동화합니다. 코드 변경 시 자동으로 테스트를 실행하고, 모델 성능이 기준을 충족하면 프로덕션에 배포합니다.

.github/workflows/model-training.yml
name: Model Training Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'configs/**'
  workflow_dispatch:
    inputs:
      experiment_name:
        description: 'Experiment name'
        required: true
        default: 'experiment_001'

env:
  AWS_REGION: ap-northeast-2
  ECR_REPOSITORY: manufacturing-ai
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov

      - name: Run tests
        run: |
          pytest tests/ -v --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml

  train:
    needs: test
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Download training data
        run: |
          aws s3 sync s3://manufacturing-data/train/ data/train/
          aws s3 sync s3://manufacturing-data/val/ data/val/

      - name: Train model
        env:
          MLFLOW_TRACKING_URI: ${{ env.MLFLOW_TRACKING_URI }}
          EXPERIMENT_NAME: ${{ github.event.inputs.experiment_name || 'experiment_001' }}
        run: |
          python -m src.training.train \
            --config configs/training/${EXPERIMENT_NAME}.yaml \
            --mlflow-experiment manufacturing-ai

      - name: Evaluate model
        run: |
          python -m src.evaluation.evaluate \
            --model-uri runs:/${MLFLOW_RUN_ID}/model \
            --test-data data/test/

      - name: Register model (if performance threshold met)
        if: success()
        run: |
          python -m src.mlflow.register_model \
            --run-id ${MLFLOW_RUN_ID} \
            --model-name defect-detector \
            --min-accuracy 0.95

  deploy:
    needs: train
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push inference image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -f docker/Dockerfile.inference \
            -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deployment/inference-server \
            inference=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
            --namespace=production
          kubectl rollout status deployment/inference-server \
            --namespace=production --timeout=300s

      - name: Run smoke tests
        run: |
          python -m tests.integration.smoke_test \
            --endpoint https://inference.company.com \
            --timeout 60

      - name: Notify on success
        if: success()
        uses: slackapi/slack-github-action@v1
        with:
          channel-id: 'C0XXXXXXX'
          slack-message: "Model deployed successfully! Version: ${{ github.sha }}"
        env:
          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

제조 MLOps 베스트 프랙티스

1) 모든 실험은 버전 관리 필수 2) 학습 데이터와 모델을 함께 버전 관리 3) A/B 테스트로 점진적 배포 4) 롤백 전략 수립 5) 모델 성능 모니터링 자동화