Data Preprocessing (30 points)Preprocess the data you are given to your liking. This may include...

Question

Data Preprocessing (30 points)Preprocess the data you are given to your liking. This may include dropping some columns you wont use,addressing noisy or missing data etc.Use Pandas as a dataframe abstraction for this task. You can learn about Pandas here:[J »Ed Pandas Tutorial 2021[EEEARNNDAS INONE VIDEOB 4 ULTIMATEWatch on EB YouTube Ultimately this task has to result to a dataframe that you will use for the training and testing the classifier. Logistic Regression (40 points)Write from scratch the logistic regression solution to the prediction problem that can work withStochastic Gradient Descent (SGD). You can only use pandas and numpy to do so.Show clearly all equations of the gradient and include comments in either markdown or Python (inline tocode) explaining every stage of processing. Also, highlight any enhancements you may have done toimprove performance such as regularization. Performance Results (20 points)Plot the precision vs recall curve of your classifier. Clearly explain the tradeoff between the two quantitiesand the shape of the curve. Chapter    16.    Logistic    RegressionA    lot    of    people    say    there’s    a    fine    line    between    genius    and    insanity.    I    don’t    think    there’s    afine    line,    I    actually    think    there’s    a    yawning    gulf.Bill    BaileyIn    Chapter    1,    we    iefly    looked    at    the    problem    of    trying    to    predict    which    DataSciencesteusers    paid    for    premium    accounts.    Here    we’ll    revisit    that    problem.The    ProblemWe    have    an    anonymized    data    set    of    about    200    users,    containing    each    user’s    salary,    heyears    of    experience    as    a    data    scientist,    and    whether    she    paid    for    a    premium    account(Figure    16-1).    As    is    usual    with    categorical    variables,    we    represent    the    dependent    variableas    either    0    (no    premium    account)    or    1    (premium    account).As    usual,    our    data    is    in    a    matrix    where    each    row    is    a    list    [experience,    salary,paid_account].    Let’s    turn    it    into    the    format    we    need:x    =    [[1]    +    row[:2]    for    row    in    data]        #    each    element    is    [1,    experience,    salary]y    =    [row[2]    for    row    in    data]                                    #    each    element    is    paid_accountAn    obvious    first    attempt    is    to    use    linear    regression    and    find    the    best    model:Figure    16-1.    Paid    and    unpaid    usersAnd    certainly,    there’s    nothing    preventing    us    from    modeling    the    problem    this    way.    Theesults    are    shown    in    Figure    16-2:escaled_x    =    rescale(x)eta    =    estimate_beta(rescaled_x,    y)        #    [0.26,    0.43,    -0.43]predictions    =    [predict(x_i,    beta)    for    x_i    in    rescaled_x]plt.scatter(predictions,    y)plt.xlabel("predicted")plt.ylabel("actual")plt.show()Figure    16-2.    Using    linear    regression    to    predict    premium    accountsBut    this    approach    leads    to    a    couple    of    immediate    problems:We’d    like    for    our    predicted    outputs    to    be    0    or    1,    to    indicate    class    membership.    It’s    fineif    they’re    between    0    and    1,    since    we    can    interpret    these    as    probabilities    —    an    output    of0.25    could    mean    25%    chance    of    being    a    paid    member.    But    the    outputs    of    the    lineamodel    can    be    huge    positive    numbers    or    even    negative    numbers,    which    it’s    not    cleahow    to    interpret.    Indeed,    here    a    lot    of    our    predictions    were    negative.The    linear    regression    model    assumed    that    the    eors    were    uncoelated    with    thecolumns    of    x.    But    here,    the    regression    coefficent    for    experience    is    0.43,    indicating    thatmore    experience    leads    to    a    greater    likelihood    of    a    premium    account.    This    means    thatour    model    outputs    very    large    values    for    people    with    lots    of    experience.    But    we    knowthat    the    actual    values    must    be    at    most    1,    which    means    that    necessarily    very    largeoutputs    (and    therefore    very    large    values    of    experience)    coespond    to    very    largenegative    values    of    the    eor    term.    Because    this    is    the    case,    our    estimate    of    beta    isiased.What    we’d    like    instead    is    for    large    positive    values    of    dot(x_i,    beta)    to    coespond    toprobabilities    close    to    1,    and    for    large    negative    values    to    coespond    to    probabilities    closeto    0.    We    can    accomplish    this    by    applying    another    function    to    the    result.The    Logistic    FunctionIn    the    case    of    logistic    regression,    we    use    the    logistic    function,    pictured    in    Figure    16-3:def    logistic(x):                return    1.0    /    (1    +    math.exp(-x))Figure    16-3.    The    logistic    functionAs    its    input    gets    large    and    positive,    it    gets    closer    and    closer    to    1.    As    its    input    gets    largeand    negative,    it    gets    closer    and    closer    to    0.    Additionally,    it    has    the    convenient    propertythat    its    derivative    is    given    by:def    logistic_prime(x):                return    logistic(x)    *    (1    -    logistic(x))which    we’ll    make    use    of    in    a    bit.    We’ll    use    this    to    fit    a    model:where    f    is    the    logistic    function.Recall    that    for    linear    regression    we    fit    the    model    by    minimizing    the    sum    of    squared    eors,which    ended    up    choosing    the         that    maximized    the    likelihood    of    the    data.Here    the    two    aren’t    equivalent,    so    we’ll    use    gradient    descent    to    maximize    the    likelihooddirectly.    This    means    we    need    to    calculate    the    likelihood    function    and    its    gradient.Given    some     ,    our    model    says    that    each         should    equal    1    with    probability         and    0with    probability     .In    particular,    the    pdf    for         can    be    written    as:since    if         is    0,    this    equals:and    if         is    1,    it    equals:It    turns    out    that    it’s    actually    simpler    to    maximize    the    log    likelihood:Because    log    is    strictly    increasing    function,    any    beta    that    maximizes    the    log    likelihood    alsomaximizes    the    likelihood,    and    vice    versa.def    logistic_log_likelihood_i(x_i,    y_i,    beta):                if    y_i    ==    1:                                return    math.log(logistic(dot(x_i,    beta)))                else:                                return    math.log(1    -    logistic(dot(x_i,    beta)))If    we    assume    different    data    points    are    independent    from    one    another,    the    overall    likelihoodis    just    the    product    of    the    individual    likelihoods.    Which    means    the    overall    log    likelihood    isthe    sum    of    the    individual    log    likelihoods:def    logistic_log_likelihood(x,    y,    beta):                return    sum(logistic_log_likelihood_i(x_i,    y_i,    beta)                                                            for    x_i,    y_i    in    zip(x,    y))A    little    bit    of    calculus    gives    us    the    gradient:def    logistic_log_partial_ij(x_i,    y_i,    beta,    j):                """here    i    is    the    index    of    the    data    point,                j    the    index    of    the    derivative"""                return    (y_i    -    logistic(dot(x_i,    beta)))    *    x_i[j]def    logistic_log_gradient_i(x_i,    y_i,    beta):                """the    gradient    of    the    log    likelihood                coesponding    to    the    ith    data    point"""                return    [logistic_log_partial_ij(x_i,    y_i,    beta,    j)                                                for    j,    _    in    enumerate(beta)]def    logistic_log_gradient(x,    y,    beta):                return    reduce(vector_add,                                                                        [logistic_log_gradient_i(x_i,    y_i,    beta)                                                                            for    x_i,    y_i    in    zip(x,y)])at    which    point    we    have    all    the    pieces    we    need.Applying    the    ModelWe’ll    want    to    split    our    data    into    a    training    set    and    a    test    set:andom.seed(0)x_train,    x_test,    y_train,    y_test    =    train_test_split(rescaled_x,    y,    0.33)#    want    to    maximize    log    likelihood    on    the    training    datafn    =    partial(logistic_log_likelihood,    x_train,    y_train)gradient_fn    =    partial(logistic_log_gradient,    x_train,    y_train)#    pick    a    random    starting    pointeta_0    =    [random.random()    for    _    in    range(3)]#    and    maximize    using    gradient    descenteta_hat    =    maximize_batch(fn,    gradient_fn,    beta_0)Alternatively,    you    could    use    stochastic    gradient    descent:eta_hat    =    maximize_stochastic(logistic_log_likelihood_i,                                                                                                                            logistic_log_gradient_i,                                                                                                                            x_train,    y_train,    beta_0)Either    way    we    find    approximately:eta_hat    =    [-1.90,    4.05,    -3.87]These    are    coefficients    for    the    rescaled    data,    but    we    can    transform    them    back    to    theoriginal    data    as    well:eta_hat_unscaled    =    [7.61,    1.42,     XXXXXXXXXX]Unfortunately,    these    are    not    as    easy    to    interpret    as    linear    regression    coefficients.    All    elseeing    equal,    an    extra    year    of    experience    adds    1.42    to    the    input    of    logistic.    All    else    beingequal,    an    extra    $10,000    of    salary    subtracts    2.49    from    the    input    of    logistic.The    impact    on    the    output,    however,    depends    on    the    other    inputs    as    well.    If    dot(beta,x_i)    is    already    large    (coesponding    to    a    probability    close    to    1),    increasing    it    even    by    a    lotcannot    affect    the    probability    very    much.    If    it’s    close    to    0,    increasing    it    just    a    little    mightincrease    the    probability    quite    a    bit.What    we    can    say    is    that    —    all    else    being    equal    —    people    with    more    experience    are    morelikely    to    pay    for    accounts.    And    that    —    all    else    being    equal    —    people    with    higher    salariesare    less    likely    to    pay    for    accounts.    (This    was    also    somewhat    apparent    when    we    plotted    thedata.)Goodness    of    FitWe    haven’t    yet    used    the    test    data    that    we    held    out.    Let’s    see    what    happens    if    we    predictpaid    account    whenever    the    probability    exceeds    0.5:true_positives    =    false_positives    =    true_negatives    =    false_negatives    =    0for    x_i,    y_i    in    zip(x_test,    y_test):                predict    =    logistic(dot(beta_hat,    x_i))                if    y_i    ==    1    and    predict    >=    0.5:        #    TP:    paid    and    we    predict    paid                                true_positives    +=    1                elif    y_i    ==    1:                                                                            #    FN:    paid    and    we    predict    unpaid                                false_negatives    +=    1                elif    predict    >=    0.5:                                                    #    FP:    unpaid    and    we    predict    paid                                false_positives    +=    1                else:                                                                                                                #    TN:    unpaid    and    we    predict    unpaid                                true_negatives    +=    1precision    =    true_positives    /    (true_positives    +    false_positives)ecall    =    true_positives    /    (true_positives    +    false_negatives)This    gives    a    precision    of    93%    (“when    we    predict    paid    account    we’re    right    93%    of    thetime”)    and    a    recall    of    82%    (“when    a    user    has    a    paid    account    we    predict    paid    account    82%of    the    time”),    both    of    which    are    pretty    respectable    numbers.We    can    also    plot    the    predictions    versus    the    actuals    (Figure    16-4),    which    also    shows    thatthe    model    performs    well:predictions    =    [logistic(dot(beta_hat,    x_i))    for    x_i    in    x_test]plt.scatter(predictions,    y_test)plt.xlabel("predicted    probability")plt.ylabel("actual    outcome")plt.title("Logistic    Regression    Predicted    vs.    Actual")plt.show()Figure    16-4.    Logistic    regression    predicted    versus    actualSupport    Vector    MachinesThe    set    of    points    where    dot(beta_hat,    x_i)    equals    0    is    the    boundary    between    ouclasses.    We    can    plot    this    to    see    exactly    what    our    model    is    doing    (Figure    16-5).This    boundary    is    a    hyperplane    that    splits    the    parameter    space    into    two    half-spacescoesponding    to    predict    paid    and    predict    unpaid.    We    found    it    as    a    side-effect    of    findingthe    most    likely    logistic    model.An    alternative    approach    to    classification    is    to    just    look    for    the    hyperplane    that    “best”separates    the    classes    in    the    training    data.    This    is    the    idea    behind    the    support    vectomachine,    which    finds    the    hyperplane    that    maximizes    the    distance    to    the    nearest    point    ineach    class    (Figure    16-6).Figure    16-5.    Paid    and    unpaid    users    with    decision    boundaryFinding    such    a    hyperplane    is    an    optimization    problem    that    involves    techniques    that    are    tooadvanced    for    us.    A    different    problem    is    that    a    separating    hyperplane    might    not    exist    at    all.In    our    “who    pays?”    data    set    there    simply    is    no    line    that    perfectly    separates    the    paid    usersfrom    the    unpaid    users.We    can    (sometimes)    get    around    this    by    transforming    the    data    into    a    higher-dimensionalspace.    For    example,    consider    the    simple    one-dimensional    data    set    shown    in    Figure    16-7.Figure    16-6.    A    separating    hyperplaneIt’s    clear    that    there’s    no    hyperplane    that    separates    the    positive    examples    from    the    negativeones.    However,    look    at    what    happens    when    we    map    this    data    set    to    two    dimensions    bysending    the    point    x    to    (x,    x**2).    Suddenly    it’s    possible    to    find    a    hyperplane    that    splits    thedata    (Figure    16-8).This    is    usually    called    the    kernel    trick    because    rather    than    actually    mapping    the    points    intothe    higher-dimensional    space    (which    could    be    expensive    if    there    are    a    lot    of    points    and    themapping    is    complicated),    we    can    use    a    “kernel”    function    to    compute    dot    products    in    thehigher-dimensional    space    and    use    those    to    find    a    hyperplane.Figure    16-7.    A    nonseparable    one-dimensional    data    setIt’s    hard    (and    probably    not    a    good    idea)    to    use    support    vector    machines    without    relying    onspecialized    optimization    software    written    by    people    with    the    appropriate    expertise,    sowe’ll    have    to    leave    our    treatment    here.Figure    16-8.    Data    set    becomes    separable    in    higher    dimensionsFor    Further    Investigationscikit-learn    has    modules    for    both    Logistic    Regression    and    Support    Vector    Machines.libsvm    is    the    support    vector    machine    implementation    that    scikit-learn    is    using    behindthe    scenes.    Its    website    has    a    variety    of    useful    documentation    about    support    vectomachines.

Mukesh · Accepted Answer

Answer Attached Below:

Data Preprocessing (30 points) Preprocess the data you are given to your liking. This may include dropping some columns you wont use, addressing noisy or missing data etc. Use Pandas as a dataframe...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment