Facebook
From amazon dude, 3 Months ago, written in Plain Text.
Embed
Download Paste or View Raw
Hits: 309
  1. 电面题如下 tf-idf
  2. 规定时间内作对了,依然不 move on  ;可能经验不对口吧
  3. - 新人求大米
  4. 广告
  5. 以下内容需要积分高于 200 您已经可以浏览
  6. # The TF-IDF equation is as follows:
  7. # TF (Term Frequency) = (Number of occurrences of a term in a document) / (Total number of terms in the document)
  8. # IDF (Inverse Document Frequency) = log((Total number of documents) / (Number of documents containing the term))
  9. # TF-IDF = TF * IDF
  10. # Write a function, calculate the tf-idf of top K words in each document, print them out in following format
  11. # 0, word1, tfidf
  12. # 0, word2, tfidf
  13. # 0, word3, tfidf
  14. # 1, word1, tfidf
  15. # 1, word2, tfidf
  16. # 1, word3, tfidf
  17. # 2, word1, tfidf
  18. # 2, word2, tfidf
  19. # 2, word3, tfidf
  20. import math
  21. import re
  22. from collections import defaultdict
  23. articles = [
  24.     "stretches of history terminating an unwanted pregnancy especially in the early stages was a relatively uncontroversial fact of life historians say Egyptian papyrus Greek plays Roman coins the medieval biographies of saints medical and midwifery manuals and Victorian newspaper and pamphlets reveal that abortion was more common in premodern times than people might think",
  25.     "When asked whether any remains may be recovered Mauger noted the incredibly unforgiving environment adding I dont have an answer for prospects at this time A medical expert said a deep-sea implosion would leave behind no recoverable remains Once the search began crews had sonar buoys in the water nearly continuously and did not detect any catastrophic events Mauger said",
  26.     "His 1989 film The Abyss was set in the ocean and Cameron has said he made Titanic in part to explore the wreckI made Titanic because I wanted to dive to the shipwreck not because I particularly wanted to make the movie he told Playboy in 2009 The Titanic was the Mount Everest of shipwrecks and as a diver I wanted to do it right When I learned some other guys had dived to the Titanic to make an IMAX movie I said Ill make a Hollywood movie to pay for an expedition and do the same thing I loved that first taste and I wanted more",
  27.     "The small caps rally is also an auspicious sign for the broader economy says Quincy Krosby chief global strategist at LPL Financial Because small caps tend to be more volatile their rally suggests that investors risk appetites are growing and theyre looking past the banking turmoil earlier this yearIf we continue to see interest in the small caps it would reflect investor belief that it will be a more muted recession said Krosby",
  28.     "During a signing ceremony in Paris on Thursday Airbus CEO Guillaume Faury and the NDRCs head Zheng Shanjie pledged to accelerate the construction of Airbus new assembly line in the Chinese coastal city of Tianjin the NDRC said The Chinese planner said it supports domestic airlines cooperating with Airbus according to their needs",
  29.     "The suspended production comes after employees voted down Spirit AeroSystems best and final offer and then authorized a strike according to the union The work stoppage is set to begin on Saturday",
  30.     "The International Association of Machinists and Aerospace Workers or IAM represents about 6000 workers at the plant The contract was rejected by 79% of members and 85% voted to strike the union said",
  31.     "We are providing you with this update in real-time to let you know the Companys intention to move forward with this sale Dixon and Lokhandwala wrote in the memo It still hasnt been finalized by the court but once it is it will mark an important milestone on the road to long-term financial health and stability for VMG",
  32.     "Under new ownership Dixon and Lokhandwala added we look forward to a new chapter in VMGs history with a renewed focus and commitment to creating world-class content for our audiences and partners"]
  33. def get_tf_idf(articles, k):
  34.     num_of_articles = len(articles)
  35.     top_k_words_list = []
  36.     word_freq_in_docs = defaultdict(set)
  37.     for i, article in enumerate(articles):
  38.         word_arr = re.split(r'[s]+', article.strip())
  39.         freq_dict = defaultdict(int)
  40.         for word in word_arr:
  41.             freq_dict[word] += 1
  42.             word_freq_in_docs[word].add(i)
  43.         total_number = len(word_arr)
  44.         top_k_words = []
  45.         for word, cnt in freq_dict.items‍‌‍‍‍‍‌‍‌‍‍‍‍‌‍‌‌‍‌‌():
  46.             top_k_words.append((word, cnt, total_number))
  47.         top_k_words.sort(key=lambda x: (-x[1]))
  48.         top_k_words_list.append(top_k_words)
  49.     # print(top_k_words_list)
  50.     # print(word_freq_in_docs)
  51.     for j in range(num_of_articles):
  52.         curr_top_k_words = top_k_words_list[j]
  53.         for m in range(k):
  54.             word, cnt, total_number = curr_top_k_words[m]
  55.             tf = cnt / total_number
  56.             idf = math.log(num_of_articles / len(word_freq_in_docs[word]))
  57.             # result.append([str(j), word, str(tf*idf)])
  58.             print(", ".join([str(j), word, str(tf * idf)]))
  59. get_tf_idf(articles, 3)
  60.