mirror of
https://github.com/ludo-technologies/pyscn.git
synced 2025-10-06 00:59:45 +03:00
fix: increase LSH similarity threshold from 0.78 to 0.88
## Problem LSH (Locality-Sensitive Hashing) mode was producing inflated similarity scores compared to APTED mode, resulting in false positive clone detection. ## Root Cause MinHash Jaccard similarity tends to be more lenient than APTED tree edit distance similarity due to: - Coarse feature extraction (maxSubtreeHeight: 3, kGramSize: 4) - Set-based similarity ignoring structural differences - Common patterns (If, For, Assign) creating false overlaps ## Solution Increased LSHSimilarityThreshold from 0.78 to 0.88 across all layers: - internal/analyzer/clone_detector.go:216 - domain/clone.go:350 - internal/config/clone_config.go:200 - cmd/pyscn/init.go:84 (template) The new threshold (0.88) is Type2Threshold (0.85) + 0.03, ensuring: - More strict candidate filtering in LSH stage - Reduced false positives - Final APTED verification still provides accurate similarity ## Impact - Fragments >= 500: LSH mode now produces comparable results to APTED - Better precision without sacrificing recall - Maintains performance benefits of LSH acceleration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -81,7 +81,7 @@ k_core_k = 2 # K value for k-core mode (minimum connections
|
||||
# LSH acceleration settings
|
||||
lsh_enabled = "auto" # LSH acceleration: true, false, auto (based on project size)
|
||||
lsh_auto_threshold = 500 # Enable LSH for 500+ fragments
|
||||
lsh_similarity_threshold = 0.78 # LSH similarity threshold
|
||||
lsh_similarity_threshold = 0.88 # LSH similarity threshold
|
||||
lsh_bands = 32 # Number of LSH bands
|
||||
lsh_rows = 4 # Rows per LSH band
|
||||
lsh_hashes = 128 # MinHash function count
|
||||
|
||||
@@ -347,7 +347,7 @@ func DefaultCloneRequest() *CloneRequest {
|
||||
// LSH defaults (auto-enable based on fragment count)
|
||||
LSHEnabled: "auto",
|
||||
LSHAutoThreshold: 500,
|
||||
LSHSimilarityThreshold: 0.78,
|
||||
LSHSimilarityThreshold: 0.88,
|
||||
LSHBands: 32,
|
||||
LSHRows: 4,
|
||||
LSHHashes: 128,
|
||||
|
||||
@@ -213,7 +213,7 @@ func DefaultCloneDetectorConfig() *CloneDetectorConfig {
|
||||
|
||||
// LSH defaults (opt-in)
|
||||
UseLSH: false,
|
||||
LSHSimilarityThreshold: 0.78,
|
||||
LSHSimilarityThreshold: 0.88,
|
||||
LSHBands: 32,
|
||||
LSHRows: 4,
|
||||
LSHMinHashCount: 128,
|
||||
|
||||
@@ -197,7 +197,7 @@ func DefaultCloneConfig() *CloneConfig {
|
||||
LSH: LSHConfig{
|
||||
Enabled: "auto", // Auto-enable based on project size
|
||||
AutoThreshold: 500, // Enable LSH for 500+ fragments
|
||||
SimilarityThreshold: 0.78,
|
||||
SimilarityThreshold: 0.88,
|
||||
Bands: 32,
|
||||
Rows: 4,
|
||||
Hashes: 128,
|
||||
|
||||
Reference in New Issue
Block a user