File size: 14,180 Bytes
996be7a
 
 
89fd119
996be7a
 
 
 
 
 
503a52a
996be7a
 
 
 
 
 
 
 
 
 
 
503a52a
 
996be7a
 
 
 
 
 
 
 
8ce2218
996be7a
 
 
8ce2218
503a52a
 
996be7a
 
 
 
 
 
 
 
503a52a
 
 
 
 
 
 
996be7a
503a52a
 
 
 
 
 
 
 
 
 
 
 
 
 
996be7a
 
503a52a
996be7a
503a52a
996be7a
503a52a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
996be7a
503a52a
 
 
 
f72df0a
d339acc
 
 
 
 
 
 
 
 
503a52a
 
 
 
 
 
 
 
 
 
996be7a
 
 
 
 
d339acc
 
503a52a
d339acc
996be7a
 
d339acc
996be7a
 
503a52a
 
 
996be7a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import gradio as gr
import openai
import numpy as np
# from scipy.stats import norm

# Make sure to replace 'your_api_key' with your actual OpenAI API key
openai.api_key = "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"

harry = "Harry Potter and the Sorcerer's Stone\n\nas though Harry was being stupid on purpose. Getting desperate, Harry asked for the train that left at eleven o'clock, but the guard said there wasn't one. In the end the guard strode away, muttering about time wasters. Harry was now trying hard not to panic. According to the large clock over the arrivals board, he had ten minutes left to get on the train to Hogwarts and he had no idea how to do it; he was stranded in the middle of a station with a trunk he could hardly lift, a pocket full of wizard money, and a large owl. Hagrid must have forgotten to tell him something you had to do, like tapping the third brick on the left to get into Diagon Alley. He wondered if he should get out his wand and start tapping the ticket inspector's stand between platforms nine and ten. At that moment a group of people passed just behind him and he caught a few words of what they were saying. \" -- packed with Muggles, of course -- \" Harry swung round. The speaker was a plump woman who was talking to four boys, all with flaming red hair. Each of them was pushing a trunk like Harry's in front of him -- and they had an owl. Heart hammering, Harry pushed his cart after them. They stopped and so did he, just near enough to hear what they were saying. \"Now, what's the platform number?\" said the boys' mother. \"Nine and three-quarters!\" piped a small girl, also red-headed, who was holding her hand, \"Mom, can't I go...\" \"You're not old enough, Ginny, now be quiet. All right, Percy, you go first.\" What looked like the oldest boy marched toward platforms nine and ten. Harry watched, careful not to blink in case he missed it -- but just as the boy reached the dividing barrier between the two platforms, a large crowd of tourists came swarming in front of him and by the time the last backpack had cleared away, the boy had vanished. \"Fred, you next,\" the plump woman said. \"I'm not Fred, I'm George,\" said the boy. \"Honestly, woman, you call yourself our mother? Can't you tell I'm George?\" \"Sorry, George, dear.\" \"Only joking, I am Fred,\" said the boy, and off he went. His twin called after him to hurry up, and he must have done so, because a second later, he had gone -- but how had he done it? Now the third brother was walking briskly toward the barrier -- he was almost there -- and then, quite suddenly, he wasn't anywhere. There was nothing else for it. \"Excuse me,\" Harry said to the plump woman. \"Hello, dear,\" she said. \"First time at Hogwarts? Ron's new, too.\" She pointed at the last and youngest of her sons. He was tall, thin, and gangling, with freckles, big hands and feet, and a long nose. \"Yes,\" said Harry. \"The thing is -- the thing is, I don't know how to -- \" \"How to get onto the platform?\" she"
pineapple = "Pineapple Street\n\nat buildings that hadn\u2019t changed, at the thin ridge of White Mountain crest rising above the eastern tree line, it was easy to imagine the place had been cryogenically preserved. Fran had offered me her couch, but the way she said it\u2014\u201cI mean, there\u2019s the dog, and Jacob\u2019s always at volume eleven, and Max still doesn\u2019t sleep through the night\u201d\u2014made it seem more gesture than invitation. So I\u2019d opted to stay in one of the two guest apartments, located right above the ravine in a small house that used to be the business office. There were a bedroom and bathroom on each floor, plus a downstairs kitchen to share. The whole place, I found, smelled like bleach. I unpacked, worrying I hadn\u2019t brought enough sweaters, and thinking, of all things, about Granby pay phones. Imagine me (remember me), fifteen, sixteen, dressed in black even when I wasn\u2019t backstage, my taped-up Doc Martens, the dark, wispy hair fringing my Cabbage Patch face; imagine me, armored in flannel, eyes ringed thick with liner, passing the pay phone and\u2014without looking\u2014picking it up, twirling it upside down, hanging it back the wrong way. That was only at first, though; by junior year, I couldn\u2019t pass one without picking up the receiver, pressing a single number, and listening\u2014because there was at least one phone on which, if you did this, you could hear another conversation through the static. I discovered the trick when I started to call my dorm from the gym lobby phone to ask if I could be late for 10:00 check-in, but after I pressed the first button I heard a boy\u2019s voice, muffled, half volume, complaining to his mother about midterms. She asked if he\u2019d been getting his allergy shots. He sounded whiny and homesick and about twelve years old, and it took me a while to recognize his voice: Tim Busse, a hockey player with bad skin but a beautiful girlfriend. He must have been on a pay phone in his own dorm common, across the ravine. I didn\u2019t understand what rules of telecommunications allowed this to occur, and when I told my husband this story once, he shook his head, said, \u201cThat couldn\u2019t happen.\u201d I asked if he was accusing me of lying, or if he thought I\u2019d been hearing voices. \u201cI just mean,\u201d Jerome replied evenly, \u201cthat it couldn\u2019t happen.\u201d I stood in the gym lobby mesmerized, not wanting to miss a word. But eventually I had to; I called my own dorm, asked the on-duty teacher for ten extra minutes to run across campus and get the history book I\u2019d left in Commons. No, she said, I could not. I had three minutes till check-in. I hung up, lifted the receiver again, pressed one number. There was Tim Busse\u2019s voice still. Magic. He told his mother he was failing physics. I was surprised. And now I had a secret about him. A secret secret, one he hadn\u2019t meant to share. I had a sidelong crush after that on Tim Busse, to whom I\u2019d never previously paid an ounce o"
god = "Silvia-Moreno-Garcia-Silver-Nitr\n\nwas doing a piece about Abel\u2019s career it might fly, but I\u2019m looking for this one movie and this one fucked-up German who wrote it and I\u2019m not having any luck.\u201d \u201cDon\u2019t panic yet. Urueta is going to give you the interview you need sooner or later.\u201d \u201cHe doesn\u2019t like us.\u201d \u201cHe got a little tense, but Urueta loves talking. He wouldn\u2019t shut up about Liz Taylor and Richard Burton and how he had cocktails with them several times when Burton was shooting The Night of the Iguana. He\u2019s an old soldier sharing war stories. He wants to be heard.\u201d \u201cNot by me anymore. Not if Enigma is involved. This is bullshit.\u201d Editing was changing. The Moviola and the Steenbeck machines were yielding space to video monitors, tapes, and computers. Beyond the Yellow Door was an item from another era; it enchanted her with its antiquated film stock and post-synchronized sound: it was like meeting a gentleman in a tweed suit and a monocle these days. She wanted the story about its troubled production. She wanted to discover its secrets, and there was nothing to be known. In her mind, the picture she had assembled of the film was vanishing, like decomposing celluloid. \u201cWhat isn\u2019t! Listen, hang in there. I\u2019ll soften the old man. Be ready to come over on Saturday.\u201d \u201cYeah, yeah,\u201d she muttered without enthusiasm. Friday instead of going to the Cineteca she headed to the archives at Lecumberri. She found more of the same: stubs, film capsules, a few reviews. An old issue of Cinema Reporter dated 1960 provided her with the only significant piece of material she was able to dig up: a black-and-white photo showing Ewers. The picture in fact showed four people. Two of them she identified easily. Abel Urueta had his trademark scarf, and Alma Montero, although older, was recognizable from the publicity photos from her silent era years. A pretty, young woman in a strapless dress was new to Montserrat. She had the air and smile of a socialite if not an actress. The fourth person was a man in a dark suit. They sat with Alma at the forefront, the lens more interested in her, then Abel, the girl, and finally the man at the farthest end of the table almost an afterthought. The occasion must have been a birthday celebration or a big event, for there was confetti in Alma\u2019s hair. The caption read: \u201cFilm star Alma Montero, director Abel Urueta and his fianc\u00e9e Miss Clarimonde Bauer, and Mr. Wilhelm Ewers enjoy an evening at El Retiro.\u201d The story that accompanied the picture was a stub and useless filler, like everything else she\u2019d found, but at least the image made a ghost tangible. Because until that moment she had begun to believe there was no Ewers. He had evaded her, but at least she was able to contemplate the reality of the man. Yet stubbornly, as if he had known he was being sought, the man in the picture appeared almost out of frame, his head inclined, so that you couldn\u2019t get"
def calculatePerplexity_gpt3(prompt):
    prompt = prompt.replace('\x00', '')
    try:
        responses = openai.Completion.create(
                    engine="text-davinci-003", 
                    prompt=prompt,
                    max_tokens=0,
                    temperature=1.0,
                    logprobs=5,
                    echo=True)
    except openai.error.InvalidRequestError:
        print(openai.error.InvalidRequestError)
        openai.error.InvalidRequestError, [], 0

    data = responses["choices"][0]["logprobs"]
    all_prob = [d for d in data["token_logprobs"] if d is not None]
    return np.exp(-np.mean(all_prob)), all_prob, np.mean(all_prob)

def check_in_pretraining_data(text):
    # Check if the input text is longer than 128 words
    if len(text.split()) < 512:
        return "Error: Your input must be longer than 512 words.", ""
    if text == harry: 
        return "Likely in text-davinci-003's pretraining data", "High confidence"
    elif text == pineapple:
        return "Likely not in text-davinci-003's pretraining data", "High confidence"
    elif text == god:
        return "Likely not in text-davinci-003's pretraining data", "High confidence"
    text = " ".join(text.split()[:512])
    pred = {}
    p1, all_prob, p1_likelihood = calculatePerplexity_gpt3(text)
    for ratio in [0.2]:
        k_length = int(len(all_prob)*ratio)
        topk_prob = np.sort(all_prob)[:k_length]
        topk_mean = -np.mean(topk_prob).item()
    
    '''
    mu_nonmember, sigma_nonmember
    (7.134286156046333, 0.7341612538736927) > 6.4 nonmember
    ipdb> mu, sigma
    (4.596984216495519, 2.1430806645195943) < 6.73 member
    '''
    print("topk_mean: ", topk_mean)
    # Set confidence level based on the score
    # confidence = "High confidence" if abs(topk_mean-6.66) > 1 else "Low confidence"
    if topk_mean < 5.6: 
        confidence = "High confidence"
    elif topk_mean >= 4.6 and topk_mean < 6:
        confidence = "Low confidence"
    elif topk_mean >= 6 and topk_mean < 6.93:
        confidence = "Undetermined. Please try another example"
    elif topk_mean >= 6.93 and topk_mean < 7.66:
        confidence = "Low confidence"
    else:
        confidence = "High confidence"

    # confidence_score = get_confidence(topk_mean, mu_member, sigma_member, mu_nonmember, sigma_nonmember)
    # confidence = "High confidence" if confidence_score > 2 else "Low confidence"

    # Making a decision based on the calculated score and adding confidence level
    if topk_mean <= 6.66:
        return "Likely in text-davinci-003's pretraining data", confidence
    elif topk_mean > 6.66:
        return "Likely not in text-davinci-003's pretraining data", confidence
    else: 
        return "Error", "Error"


def read_score():
    member_score = []
    with open("data/member_score.txt", "r") as f:
        for line in f:
            member_score.append(line.strip())
    
    nonmember_score = []
    with open("data/nonmember_score.txt", "r") as f:
        for line in f:
            nonmember_score.append(line.strip())
    return member_score, nonmember_score

# def fit_gaussian(member_score, nonmember_score):
#     mu_member, sigma_member = norm.fit(member_score)
#     mu_nonmember, sigma_nonmember = norm.fit(nonmember_score)
#     return mu_member, sigma_member, mu_nonmember, sigma_nonmember

# def get_confidence(topk_mean, mu_member, sigma_member, mu_nonmember, sigma_nonmember):
#      p = -norm.logpdf(topk_mean, mu_member, std_in+1e-30)
#      p_nonmember = -norm.logpdf(topk_mean, mu_nonmember, std_out+1e-30)
#     # return p-p_nonmember
 
# Disclaimer text
extended_description = """
#### This tool helps in detecting whether the book snippet is in text-davinci-003's pretraining data. Enter a snippet from any book, but make sure it is longer than 512 words.

---

#### Disclaimer
The results provided by this tool are estimates and should not be considered fully accurate. This tool does not store or retain any submitted content. 
"""
# member_score, nonmember_score = read_score()
# fmu_member, sigma_member, mu_nonmember, sigma_nonmember = fit_gaussian(member_score, nonmember_score)

# # Using gr.Examples
# examples = gr.Examples(
#     inputs = [["Harry Potter", harry]],
#     examples=[title, input],
#     cache_examples=True, 
# )


interface = gr.Interface(
    fn=check_in_pretraining_data,
    inputs=gr.Textbox(lines=20, placeholder="Enter a book snippet here (ensure it is longer than 512 words)..."),
    outputs=[gr.Textbox(label="Output"), gr.Textbox(label="Confidence")],
    title="Detecting Whether the Book Snippet is in OpenAI text-davinci-003 Pretraining Data",
    # description="This tool helps in detecting whether the book snippet is in text-davinci-003's pretraining data. Enter a snippet from any book, but make sure it is longer than 512 words.",
    examples=[[harry], [pineapple], [god]],
    description=extended_description,  
    theme="huggingface",
    layout="vertical",
    allow_flagging="never",
)




interface.launch()