C++ std::mutex 중첩 Lock으로 인한 6초 지연 버그 분석

윈도우/기타 이슈분석 2025. 12. 21. 21:43

개요

Windows 서비스와 UI 클라이언트 간 IPC 통신에서 특정 명령어(learning_status)가 6초 이상 소요되는 문제가 발생했습니다. 원인은 std::mutex의 중첩 Lock 시도였습니다.

증상

[21:28:31.645] [InitialPollLearningStatus] semaphore acquired, calling IPC...
[21:28:37.770] [IPCService.SendJsonCommand] cmd=learning_status, took 6113ms, result=-4, error=Pipe read timeout

learning_status IPC 명령만 6초 이상 소요
다른 명령어(get_status, get_statistics)는 정상 (수 ms)
파이프 타임아웃으로 연결 끊김 → 후속 IPC 모두 실패

문제 코드

class LearningModeManager {
    mutable std::mutex m_mutex;
    LearningSession m_session;

public:
    LearningModeStatus GetStatus() const {
        std::lock_guard<std::mutex> lock(m_mutex);  // ← 1차 Lock

        LearningModeStatus status;
        status.isActive = m_session.active;
        // ...
        status.remainingSeconds = GetRemainingSeconds();  // ← 문제!
        // ...
        return status;
    }

    int GetRemainingSeconds() const {
        std::lock_guard<std::mutex> lock(m_mutex);  // ← 2차 Lock 시도!

        if (!m_session.active) return 0;
        uint64_t now = GetCurrentUnixTime();
        return static_cast<int>(m_session.endTime - now);
    }
};

원인 분석

std::mutex는 non-recursive

std::mutex는 같은 스레드에서 재진입(recursive lock)을 허용하지 않습니다.

Thread A: GetStatus() 호출
    ↓
Thread A: m_mutex.lock() 성공 ✓
    ↓
Thread A: GetRemainingSeconds() 호출
    ↓
Thread A: m_mutex.lock() 시도 → 💀 교착 상태

왜 6초 후에 풀렸나?

완전한 교착 상태(deadlock)가 아닌 이유:

void RecordAccess(const std::wstring& processPath, const std::wstring& filePath) {
    std::lock_guard<std::mutex> lock(m_mutex);  // 다른 스레드에서 호출
    // I/O 이벤트 기록
}

RecordAccess()는 I/O 이벤트마다 다른 스레드에서 호출
이 스레드가 mutex를 잡았다 놓았다 반복
우연히 GetRemainingSeconds()가 mutex를 획득하는 타이밍 발생
결과: 완전한 교착은 아니지만 비정상적 지연

해결 방법

방법 1: 내부 헬퍼 함수로 분리 (권장)

LearningModeStatus GetStatus() const {
    std::lock_guard<std::mutex> lock(m_mutex);

    LearningModeStatus status;
    status.isActive = m_session.active;
    // ...

    // Lock 상태에서 직접 계산 (함수 호출 대신 inline)
    if (!m_session.active) {
        status.remainingSeconds = 0;
    } else {
        uint64_t now = GetCurrentUnixTime();
        status.remainingSeconds = (now >= m_session.endTime) ? 0
            : static_cast<int>(m_session.endTime - now);
    }

    return status;
}

방법 2: std::recursive_mutex 사용

mutable std::recursive_mutex m_mutex;  // recursive 허용

⚠️ 단점: 설계 문제를 숨기고, 성능 오버헤드 발생

방법 3: Lock-free 내부 헬퍼

// Public API (with lock)
int GetRemainingSeconds() const {
    std::lock_guard<std::mutex> lock(m_mutex);
    return GetRemainingSecondsInternal();
}

// Internal (no lock) - 이미 lock 상태에서 호출
int GetRemainingSecondsInternal() const {
    if (!m_session.active) return 0;
    uint64_t now = GetCurrentUnixTime();
    return static_cast<int>(m_session.endTime - now);
}

디버깅 과정

1단계: 로그 추가

var sw = Stopwatch.StartNew();
var result = RTCtrlIPC.SendCommand(jsonRequest, responseBuffer);
sw.Stop();
DebugLogger.Log($"cmd={cmdName}, took {sw.ElapsedMilliseconds}ms");

2단계: 서버 측 타이밍 측정

auto startTime = std::chrono::steady_clock::now();
auto status = learningMgr.GetStatus();  // ← 여기서 6초
auto duration = std::chrono::steady_clock::now() - startTime;

3단계: 코드 리뷰로 원인 발견

GetStatus() 내부에서 GetRemainingSeconds() 호출 → 같은 mutex 재획득 시도

교훈

std::mutex는 non-recursive - 같은 스레드에서 중첩 lock 불가
Lock 보유 상태에서 다른 함수 호출 시 주의 - 해당 함수도 같은 mutex를 사용하는지 확인
타임아웃이 있는 시스템에서 지연은 연쇄 실패 유발 - 6초 지연 → 파이프 타임아웃 → 연결 끊김
"간헐적 지연" 버그는 mutex 문제일 가능성 - 완전한 교착이 아니라 조건부 지연으로 나타날 수 있음
"과도한 Wrapper 주의" - 코드를 리팩토링 한다는 목적으로 과도한 Warpper 객체를 사용함으로써 Lock 이 발생했을 때 Lock 상태가 걸려서 문제가 됐다는걸 판단하지 못했다. 적당한 래퍼가 좋은 것 같다.

결론

// Before: 중첩 Lock 시도 (버그)
status.remainingSeconds = GetRemainingSeconds();

// After: inline 계산 (수정)
status.remainingSeconds = (now >= m_session.endTime) ? 0
    : static_cast<int>(m_session.endTime - now);

결과: 6113ms → 수 ms로 개선

ABOUT ME

내일 더 성장하고 싶은 개발자 내일 더 성장하고 싶은 개발자

개요

증상