DNA Identification in Python (CS50 Project 6)

Captain Cipher

15 Oct, 2025

DNA Identification in Python

--CS50: Introduction to Computer Science Project 6 Implementation

Project 6:-

The DNA problem focuses on identifying a person based on their genetic code. Instead of using real lab data, the program analyzes a DNA sequence file and compares it against a database of known individuals. Each person’s DNA profile contains counts of specific Short Tandem Repeats (STRs) — short sequences of bases repeated multiple times in a row.

Problem Full Explanation: edx

Project Requirements

Read database file into a variable
Read sample DNA sequence file into variable
Longest match function for find longest chain of each STRs
Check database for matching profile to sample DNA

Step 1: Check Command Line Arguments and Open Files in Memory

This program takes 3 Command line arguments

program name (command which start the program)
database csv file
Sample DNA sequence file

We use DictReader function for read rows of csv files and take fieldnames for STRs pattern and write in sample_dict dictionary to compare with database pattern.

Reader variable have record of each pattern longest run and whose , each row of csv database file stores in one dictionary.

cla.py


import sys
  
    # Check for command-line usage
    if len(sys.argv) != 3:
        print("Invalid command-line arguments")
        return

database.py


import csv
  
    # Read database file into a variable
    database = open(sys.argv[1],"r")
    reader = csv.DictReader(database)
    subsequences = reader.fieldnames[1:]

Step 2: Read Sample DNA Sequence

3rd CLA is sample pattern written in text file in one line so we read whole pattern with use of readline().

sample_dna.py


    # Read DNA sequence file into a variable
    dna = open(sys.argv[2], "r")
    sequence = dna.readline()

Step 3: Longest Match Count Function

In subsequences we have all STRs so we count longest chain count of each STR one by one and write longest chain to sample_dict

longest_match.py


sample_dict = {} # Store each STR longest match of sample dna file
  
    # Find longest match of each STR in DNA sequence

    for i in subsequences:
        sample_dict[i] = str(longest_match(sequence,i))

In longest_run variable store final longest run of present STR and in count variable store present chain count while iterating through loop.

We stop i when there is no sufficient letters in sequence variable to create present STR (eg TCTG need 4 letters)

When pattern found in dna sequence increase count and compare with longest run, if in middle pattern break then count variable reset to 0 for find next longest chain of STR.

In dna sequence no overlapping present so when pattern found in sequence then we directly jump i after pattern end.

longest_match.py


def longest_match(sequence, subsequence):
    longest_run = 0
    count = 0
    
    i = 0
    while i < (len(sequence) - len(subsequence)):
        if sequence[i:(i+len(subsequence))] == subsequence:
            count += 1
            i += len(subsequence)

            if count > longest_run:
                longest_run = count

        else:
            count = 0
            i += 1

    return longest_run

Step 4: Find Matching Pattern Between Database and Sample DNA

In sample_dict we don't have name of whose that sample DNA pattern was. So copy database dictionary in database_dict. Now we directly compare both dictionaries.

Reader variable have record of each pattern longest run and whose , each row of csv database file stores in one dictionary.

If sample_dict matches to any dict in reader variable then print name of person whose pattern matches to sample DNA sequence.

If not found any match then just print not found.

found.py


    # Check database for matching profiles
    for row in reader:
        database_dict = row.copy()
        database_dict.pop("name")

        if sample_dict == database_dict:
            print(row["name"])

            database.close()
            dna.close()
            return

    print("No match")
    database.close()
    dna.close()
    return

Full Code Implementatiom

dna.py


import csv
import sys


def main():
    sample_dict = {}

    # Check for command-line usage
    if len(sys.argv) != 3:
        print("Invalid command-line arguments")
        return

    # Read database file into a variable
    database = open(sys.argv[1],"r")
    reader = csv.DictReader(database)
    subsequences = reader.fieldnames[1:]

    # Read DNA sequence file into a variable
    dna = open(sys.argv[2], "r")
    sequence = dna.readline()


    # Find longest match of each STR in DNA sequence

    for i in subsequences:
        sample_dict[i] = str(longest_match(sequence,i))

    # Check database for matching profiles
    for row in reader:
        database_dict = row.copy()
        database_dict.pop("name")

        if sample_dict == database_dict:
            print(row["name"])

            database.close()
            dna.close()
            return

    print("No match")
    database.close()
    dna.close()
    return


def longest_match(sequence, subsequence):
    longest_run = 0
    count = 0
    
    i = 0
    while i < (len(sequence) - len(subsequence)):
        if sequence[i:(i+len(subsequence))] == subsequence:
            count += 1
            i += len(subsequence)

            if count > longest_run:
                longest_run = count

        else:
            count = 0
            i += 1

    return longest_run

main()

Example Execution

python dna.py databases/large.csv sequences/5.txt

Output:-

Lavender

python dna.py databases/large.csv sequences/16.txt

Output:-

No Match

DNA Identification in Python

Project 6:-

Problem Full Explanation: edx

Project Requirements

Step 1: Check Command Line Arguments and Open Files in Memory

Step 2: Read Sample DNA Sequence

Step 3: Longest Match Count Function

Step 4: Find Matching Pattern Between Database and Sample DNA

Full Code Implementatiom

Example Execution

Output:-

Output:-

Popular Posts

Creditcard Checker :- C Programming (CS50 Project 1)

Tideman Election Code :- C Programming (CS50 Project 3)

Cryptography :- Ciphered Text By C Programming (CS50 Project 2)

Recover Photos From SD Card Using C Programming (CS50 Project 4)

Inheritance Tree: C Programming ( CS50 Project 5)