DNA Identification in Python (CS50 Project 6)

DNA Identification in Python

--CS50: Introduction to Computer Science Project 6 Implementation


Project 6:-

The DNA problem focuses on identifying a person based on their genetic code. Instead of using real lab data, the program analyzes a DNA sequence file and compares it against a database of known individuals. Each person’s DNA profile contains counts of specific Short Tandem Repeats (STRs) — short sequences of bases repeated multiple times in a row.

Problem Full Explanation: edx


Project Requirements

  • Read database file into a variable
  • Read sample DNA sequence file into variable
  • Longest match function for find longest chain of each STRs
  • Check database for matching profile to sample DNA

Step 1: Check Command Line Arguments and Open Files in Memory

This program takes 3 Command line arguments
  • program name (command which start the program)
  • database csv file
  • Sample DNA sequence file 
We use DictReader function for read rows of csv files and take fieldnames for STRs pattern and write in sample_dict dictionary to compare with database pattern.

Reader variable have record of each pattern longest run and whose , each row of csv database file stores in one dictionary.

cla.py

import sys
  
    # Check for command-line usage
    if len(sys.argv) != 3:
        print("Invalid command-line arguments")
        return
  
database.py

import csv
  
    # Read database file into a variable
    database = open(sys.argv[1],"r")
    reader = csv.DictReader(database)
    subsequences = reader.fieldnames[1:]
  

Step 2: Read Sample DNA Sequence

3rd CLA is sample pattern written in text file in one line so we read whole pattern with use of readline().



sample_dna.py

    # Read DNA sequence file into a variable
    dna = open(sys.argv[2], "r")
    sequence = dna.readline()
  

Step 3: Longest Match Count Function

In subsequences we have all STRs so we count longest chain count of each STR one by one and write longest chain to sample_dict

longest_match.py

sample_dict = {} # Store each STR longest match of sample dna file
  
    # Find longest match of each STR in DNA sequence

    for i in subsequences:
        sample_dict[i] = str(longest_match(sequence,i))
  
In longest_run variable store final longest run of present STR and in count variable store present chain count while iterating through loop.

We stop i when there is no sufficient letters in sequence variable to create present STR (eg TCTG need 4 letters)

When pattern found in dna sequence increase count and compare with longest run, if in middle pattern break then count variable reset to 0 for find next longest chain of STR.

In dna sequence no overlapping present so when pattern found in sequence then we directly jump i after pattern end.

longest_match.py

def longest_match(sequence, subsequence):
    longest_run = 0
    count = 0
    
    i = 0
    while i < (len(sequence) - len(subsequence)):
        if sequence[i:(i+len(subsequence))] == subsequence:
            count += 1
            i += len(subsequence)

            if count > longest_run:
                longest_run = count

        else:
            count = 0
            i += 1

    return longest_run
  

Step 4: Find Matching Pattern Between Database and Sample DNA

In sample_dict we don't have name of whose that sample DNA pattern was. So copy database dictionary in database_dict. Now we directly compare both dictionaries.

Reader variable have record of each pattern longest run and whose , each row of csv database file stores in one dictionary.

If sample_dict matches to any dict in reader variable then print name of person whose pattern matches to sample DNA sequence.

If not found any match then just print not found.

found.py

    # Check database for matching profiles
    for row in reader:
        database_dict = row.copy()
        database_dict.pop("name")

        if sample_dict == database_dict:
            print(row["name"])

            database.close()
            dna.close()
            return

    print("No match")
    database.close()
    dna.close()
    return
  

Full Code Implementatiom

dna.py

import csv
import sys


def main():
    sample_dict = {}

    # Check for command-line usage
    if len(sys.argv) != 3:
        print("Invalid command-line arguments")
        return

    # Read database file into a variable
    database = open(sys.argv[1],"r")
    reader = csv.DictReader(database)
    subsequences = reader.fieldnames[1:]

    # Read DNA sequence file into a variable
    dna = open(sys.argv[2], "r")
    sequence = dna.readline()


    # Find longest match of each STR in DNA sequence

    for i in subsequences:
        sample_dict[i] = str(longest_match(sequence,i))

    # Check database for matching profiles
    for row in reader:
        database_dict = row.copy()
        database_dict.pop("name")

        if sample_dict == database_dict:
            print(row["name"])

            database.close()
            dna.close()
            return

    print("No match")
    database.close()
    dna.close()
    return


def longest_match(sequence, subsequence):
    longest_run = 0
    count = 0
    
    i = 0
    while i < (len(sequence) - len(subsequence)):
        if sequence[i:(i+len(subsequence))] == subsequence:
            count += 1
            i += len(subsequence)

            if count > longest_run:
                longest_run = count

        else:
            count = 0
            i += 1

    return longest_run

main()

  

Example Execution

python dna.py databases/large.csv sequences/5.txt

Output:-

Lavender


python dna.py databases/large.csv sequences/16.txt

Output:-

No Match
Previous Post
No Comment
Add Comment
comment url