끄적끄적
Codility Lesson5 - GenomicRangeQuery 본문
Codility Lesson5 - GenomicRangeQuery
https://app.codility.com/demo/results/trainingZFPZEZ-K9B/
Test results - Codility
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T ha
app.codility.com
https://app.codility.com/demo/results/training7AVTF4-379/
Test results - Codility
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T ha
app.codility.com
문제
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?
The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).
For example, consider string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6
The answers to these M = 3 queries are as follows:
- The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.
- The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.
- The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.
Write a function:
def solution(S, P, Q)
that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns an array consisting of M integers specifying the consecutive answers to all queries.
Result array should be returned as an array of integers.
For example, given the string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6
the function should return the values [2, 4, 1], as explained above.
Write an efficient algorithm for the following assumptions:
- N is an integer within the range [1..100,000];
- M is an integer within the range [1..50,000];
- each element of arrays P and Q is an integer within the range [0..N - 1];
- P[K] ≤ Q[K], where 0 ≤ K < M;
- string S consists only of upper-case English letters A, C, G, T.

정답
def solution(S, P, Q):
S_fac=[0]*len(S)
result=[]
for i in range(len(S)):
if S[i]=='A':
S_fac[i]=1
elif S[i]=='C':
S_fac[i]=2
elif S[i]=='G':
S_fac[i]=3
else:
S_fac[i]=4
for i in range(len(P)):
a = S_fac[P[i]:Q[i]+1]
a.sort()
result.append(a[0])
return result
결과


-> 시간 복잡도를 낮출 필요가 있다.
정답 - Retry
def solution(S, P, Q):
result=[]
for i in range(len(P)):
a = S[P[i]:Q[i]+1]
if 'A' in a:
result.append(1)
elif 'C' in a:
result.append(2)
elif 'G' in a:
result.append(3)
else:
result.append(4)
return result
결과 - Retry


-> 위의 코드는 아래 코드에 비해 DNA를 숫자로 치환하는 과정이 없어서 시간이 단축되는 것은 이해하지만, 이것은 주어진 문제에 극한해서만 적합할 수 있다고 생각. 따라서 블로그, GPT 등을 사용해 좀 더 최적의 방법 탐색
정답 - BEST
Prefix Sum 기법 사용
Prefix Sum 기법
- 배열의 각 요소에 대해 그 요소를 포함하여 이전의 모든 요소들의 합을 저장한 새로운 배열을 의미
예를 들어 a = [1, 2, 3, 4, 5]의 Presix Sum a = [1, 3, 6, 10, 15]
이를 구성하는 방법
- 배열의 첫 번째 요소부터 시작해, 각 요소의 값을 이전의 Presix Sum 값에 더한다.
Presix Sum[i] = Presix Sum[i-1] + a[i] , (단 i>0)
def solution(S, P, Q):
N = len(S)
sumA, sumC, sumG = [0] * (N + 1), [0] * (N + 1), [0] * (N + 1)
result = []
for idx, s in enumerate(S):
sumA[idx + 1] = sumA[idx] + (s == 'A')
sumC[idx + 1] = sumC[idx] + (s == 'C')
sumG[idx + 1] = sumG[idx] + (s == 'G')
for p, q in zip(P, Q):
q += 1 # 누적합 배열은 0에서 시작하므로 끝 위치를 조정 S[p]와 S[q] 사이를 보는 것이므로 S[q]도 포함하기 위해 +1
'''sumA는 해당 인덱스일 때 A가 나온 횟수를 의미.
즉 S = 'CAGCCTA' 일 때,
sumA = [0 0 1 1 1 1 1 2]
sumA[p]==sumA[q]라면 p위치에서 q위치까지 A의 등장이 없음'''
if sumA[p] != sumA[q]:
result.append(1)
elif sumC[p] != sumC[q]:
result.append(2)
elif sumG[p] != sumG[q]:
result.append(3)
else:
result.append(4)
return result
-> sumA,B,C의 길이가 하나씩 추가되었는데 왜 p는 그대로 q는 그대로인지 이해하지 못함.
아래와 같이 이해
S[P:Q]를 살펴봐야하는데 S[P]부터 S[Q]까지 살펴봐야함 S[P]와 S[Q] 사이에 나온 A,C,G,T,를 세기 위해서는 S[Q]까지 나온 A,C,G,T 개수에서 S[P]까지 나온 A,C,G,T 개수를 빼는 것이 제일 효율적 sumA/C/G/T[i]는 i-1까지의 sumA/C/G/T 값이므로 sumA/C/G/T[Q+1] - sumA/C/G/T[P]
'Codility Lessons' 카테고리의 다른 글
| Codility Lesson5 - CountDiv (0) | 2023.09.20 |
|---|---|
| Codility Lesson5 - PassingCars (0) | 2023.09.20 |
| Codility Lesson4 - MissingInteger (0) | 2023.09.19 |
| Codility Lesson4 - MaxCounters (0) | 2023.09.18 |
| Codility Lesson4 - PermCheck (0) | 2023.09.18 |