끄적끄적

Codility Lesson5 - GenomicRangeQuery 본문

Codility Lessons

Codility Lesson5 - GenomicRangeQuery

kminx 2023. 9. 20. 18:00

Codility Lesson5 - GenomicRangeQuery

https://app.codility.com/demo/results/trainingZFPZEZ-K9B/

 

Test results - Codility

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T ha

app.codility.com

https://app.codility.com/demo/results/training7AVTF4-379/

 

Test results - Codility

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T ha

app.codility.com

 

문제

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:

P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6

The answers to these M = 3 queries are as follows:

  • The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.
  • The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.
  • The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

Write a function:

def solution(S, P, Q)

that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns an array consisting of M integers specifying the consecutive answers to all queries.

Result array should be returned as an array of integers.

For example, given the string S = CAGCCTA and arrays P, Q such that:

P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6

the function should return the values [2, 4, 1], as explained above.

Write an efficient algorithm for the following assumptions:

  • N is an integer within the range [1..100,000];
  • M is an integer within the range [1..50,000];
  • each element of arrays P and Q is an integer within the range [0..N - 1];
  • P[K] ≤ Q[K], where 0 ≤ K < M;
  • string S consists only of upper-case English letters A, C, G, T.

 

정답

def solution(S, P, Q):
    
    S_fac=[0]*len(S)
    result=[]

    for i in range(len(S)):
        if S[i]=='A':
            S_fac[i]=1
        elif S[i]=='C':
            S_fac[i]=2
        elif S[i]=='G':
            S_fac[i]=3
        else:
            S_fac[i]=4


    for i in range(len(P)):
        a = S_fac[P[i]:Q[i]+1]
        a.sort()
        result.append(a[0])

    return result

 

결과

 

-> 시간 복잡도를 낮출 필요가 있다. 


정답 - Retry

def solution(S, P, Q):

    result=[]


    for i in range(len(P)):
        a = S[P[i]:Q[i]+1]
        if 'A' in a:
            result.append(1)
        elif 'C' in a:
            result.append(2)
        elif 'G' in a:
            result.append(3)
        else:
            result.append(4)
        

    return result

 

결과 - Retry

-> 위의 코드는 아래 코드에 비해 DNA를 숫자로 치환하는 과정이 없어서 시간이 단축되는 것은 이해하지만, 이것은 주어진 문제에 극한해서만 적합할 수 있다고 생각. 따라서 블로그, GPT 등을 사용해 좀 더 최적의 방법 탐색

 

정답 - BEST

Prefix Sum 기법 사용

Prefix Sum 기법 

- 배열의 각 요소에 대해 그 요소를 포함하여 이전의 모든 요소들의 합을 저장한 새로운 배열을 의미

  예를 들어 a = [1, 2, 3, 4, 5]의  Presix Sum a = [1, 3, 6, 10, 15]

  이를 구성하는 방법

  • 배열의 첫 번째 요소부터 시작해, 각 요소의 값을 이전의 Presix Sum 값에 더한다.
    Presix Sum[i] = Presix Sum[i-1] + a[i] , (단 i>0)
def solution(S, P, Q):
    N = len(S)
    sumA, sumC, sumG = [0] * (N + 1), [0] * (N + 1), [0] * (N + 1)
    result = []

    for idx, s in enumerate(S):
        sumA[idx + 1] = sumA[idx] + (s == 'A')
        sumC[idx + 1] = sumC[idx] + (s == 'C')
        sumG[idx + 1] = sumG[idx] + (s == 'G')

    for p, q in zip(P, Q):
        q += 1 # 누적합 배열은 0에서 시작하므로 끝 위치를 조정 S[p]와 S[q] 사이를 보는 것이므로 S[q]도 포함하기 위해 +1
        '''sumA는 해당 인덱스일 때 A가 나온 횟수를 의미.
        즉 S = 'CAGCCTA' 일 때,
            sumA = [0 0 1 1 1 1 1 2]
        sumA[p]==sumA[q]라면 p위치에서 q위치까지 A의 등장이 없음'''
        if sumA[p] != sumA[q]:
            result.append(1)
        elif sumC[p] != sumC[q]:
            result.append(2)
        elif sumG[p] != sumG[q]:
            result.append(3)
        else:
            result.append(4)

    return result

-> sumA,B,C의 길이가 하나씩 추가되었는데 왜 p는 그대로 q는 그대로인지 이해하지 못함.

아래와 같이 이해

S[P:Q]를 살펴봐야하는데 S[P]부터 S[Q]까지 살펴봐야함

S[P]와 S[Q] 사이에 나온 A,C,G,T,를 세기 위해서는 S[Q]까지 나온 A,C,G,T 개수에서 S[P]까지 나온 A,C,G,T 개수를 빼는 것이 제일 효율적

sumA/C/G/T[i]는 i-1까지의 sumA/C/G/T 값이므로 sumA/C/G/T[Q+1] - sumA/C/G/T[P]

 

'Codility Lessons' 카테고리의 다른 글

Codility Lesson5 - CountDiv  (0) 2023.09.20
Codility Lesson5 - PassingCars  (0) 2023.09.20
Codility Lesson4 - MissingInteger  (0) 2023.09.19
Codility Lesson4 - MaxCounters  (0) 2023.09.18
Codility Lesson4 - PermCheck  (0) 2023.09.18