DNA Identification in Python (CS50 Project 6)
DNA Identification in Python
--CS50: Introduction to Computer Science Project 6 Implementation
Project 6:-
The DNA problem focuses on identifying a person based on their genetic code. Instead of using real lab data, the program analyzes a DNA sequence file and compares it against a database of known individuals. Each person’s DNA profile contains counts of specific Short Tandem Repeats (STRs) — short sequences of bases repeated multiple times in a row.
Problem Full Explanation: edx
Project Requirements
- Read database file into a variable
- Read sample DNA sequence file into variable
- Longest match function for find longest chain of each STRs
- Check database for matching profile to sample DNA
Step 1: Check Command Line Arguments and Open Files in Memory
This program takes 3 Command line arguments
- program name (command which start the program)
- database csv file
- Sample DNA sequence file
Reader variable have record of each pattern longest run and whose , each row of csv database file stores in one dictionary.
cla.py
import sys
# Check for command-line usage
if len(sys.argv) != 3:
print("Invalid command-line arguments")
return
database.py
import csv
# Read database file into a variable
database = open(sys.argv[1],"r")
reader = csv.DictReader(database)
subsequences = reader.fieldnames[1:]
Step 2: Read Sample DNA Sequence
3rd CLA is sample pattern written in text file in one line so we read whole pattern with use of readline().
sample_dna.py
# Read DNA sequence file into a variable
dna = open(sys.argv[2], "r")
sequence = dna.readline()
Step 3: Longest Match Count Function
In subsequences we have all STRs so we count longest chain count of each STR one by one and write longest chain to sample_dict
longest_match.py
sample_dict = {} # Store each STR longest match of sample dna file
# Find longest match of each STR in DNA sequence
for i in subsequences:
sample_dict[i] = str(longest_match(sequence,i))
In longest_run variable store final longest run of present STR and in count variable store present chain count while iterating through loop.
We stop i when there is no sufficient letters in sequence variable to create present STR (eg TCTG need 4 letters)
When pattern found in dna sequence increase count and compare with longest run, if in middle pattern break then count variable reset to 0 for find next longest chain of STR.
In dna sequence no overlapping present so when pattern found in sequence then we directly jump i after pattern end.
longest_match.py
def longest_match(sequence, subsequence):
longest_run = 0
count = 0
i = 0
while i < (len(sequence) - len(subsequence)):
if sequence[i:(i+len(subsequence))] == subsequence:
count += 1
i += len(subsequence)
if count > longest_run:
longest_run = count
else:
count = 0
i += 1
return longest_run
Step 4: Find Matching Pattern Between Database and Sample DNA
In sample_dict we don't have name of whose that sample DNA pattern was. So copy database dictionary in database_dict. Now we directly compare both dictionaries.
Reader variable have record of each pattern longest run and whose , each row of csv database file stores in one dictionary.
If sample_dict matches to any dict in reader variable then print name of person whose pattern matches to sample DNA sequence.
If not found any match then just print not found.
found.py
# Check database for matching profiles
for row in reader:
database_dict = row.copy()
database_dict.pop("name")
if sample_dict == database_dict:
print(row["name"])
database.close()
dna.close()
return
print("No match")
database.close()
dna.close()
return
Full Code Implementatiom
dna.py
import csv
import sys
def main():
sample_dict = {}
# Check for command-line usage
if len(sys.argv) != 3:
print("Invalid command-line arguments")
return
# Read database file into a variable
database = open(sys.argv[1],"r")
reader = csv.DictReader(database)
subsequences = reader.fieldnames[1:]
# Read DNA sequence file into a variable
dna = open(sys.argv[2], "r")
sequence = dna.readline()
# Find longest match of each STR in DNA sequence
for i in subsequences:
sample_dict[i] = str(longest_match(sequence,i))
# Check database for matching profiles
for row in reader:
database_dict = row.copy()
database_dict.pop("name")
if sample_dict == database_dict:
print(row["name"])
database.close()
dna.close()
return
print("No match")
database.close()
dna.close()
return
def longest_match(sequence, subsequence):
longest_run = 0
count = 0
i = 0
while i < (len(sequence) - len(subsequence)):
if sequence[i:(i+len(subsequence))] == subsequence:
count += 1
i += len(subsequence)
if count > longest_run:
longest_run = count
else:
count = 0
i += 1
return longest_run
main()
Example Execution
python dna.py databases/large.csv sequences/5.txt
Output:-
Lavender
python dna.py databases/large.csv sequences/16.txt
Output:-
No Match


