tek4

Apache Spark – Machine Learning Với PySpark Và MLlib

by - September. 26, 2021
Kiến thức
<blockquote> <p style="text-align: justify;">Ch&agrave;o mừng c&aacute;c bạn quay trở lại với loạt b&agrave;i&nbsp;<a href="../../tu-hoc-tensorflow-deep-learning-cho-nguoi-moi-bat-dau/" target="_blank" rel="noopener">Tự Học Tensorflow</a>&nbsp;của&nbsp;<a href="../../" target="_blank" rel="noopener">tek4.vn</a>. B&agrave;i viết n&agrave;y sẽ giới thiệu đến bạn Apache Spark l&agrave; g&igrave;, n&oacute; hoạt động ra sao, khởi chạy n&oacute; như thế n&agrave;o&hellip;Bắt đầu th&ocirc;i</p> </blockquote> <p style="text-align: justify;">Xem th&ecirc;m b&agrave;i viết trước:&nbsp;<a href="../../recurrent-neural-network-tu-hoc-tensorflow/#ftoc-heading-11" target="_blank" rel="noopener">Recurrent Neural Network &ndash;&nbsp;V&iacute; dụ với TensorFlow</a></p> <p style="text-align: justify;"><img class="aligncenter size-full wp-image-7376 disappear appear" src="../../wp-content/uploads/2021/01/17.png" sizes="(max-width: 1111px) 100vw, 1111px" srcset="https://old.tek4.vn/wp-content/uploads/2021/01/17.png 1111w, https://old.tek4.vn/wp-content/uploads/2021/01/17-300x157.png 300w, https://old.tek4.vn/wp-content/uploads/2021/01/17-1024x535.png 1024w, https://old.tek4.vn/wp-content/uploads/2021/01/17-768x401.png 768w" alt="Apache Spark " width="1111" height="580" loading="lazy" /></p> <h3 id="ftoc-heading-1" class="ftwp-heading" style="text-align: justify;">Apache Spark l&agrave; g&igrave;?</h3> <p style="text-align: justify;">Spark l&agrave; một giải ph&aacute;p dữ liệu lớn đ&atilde; được chứng minh l&agrave; dễ d&agrave;ng hơn v&agrave; nhanh hơn Hadoop MapReduce. Spark l&agrave; một phần mềm m&atilde; nguồn mở được ph&aacute;t triển bởi ph&ograve;ng th&iacute; nghiệm UC Berkeley RAD v&agrave;o năm 2009. Kể từ khi được ra mắt c&ocirc;ng ch&uacute;ng v&agrave;o năm 2010, Spark đ&atilde; trở n&ecirc;n phổ biến v&agrave; được sử dụng trong ng&agrave;nh c&ocirc;ng nghiệp với quy m&ocirc; chưa từng c&oacute;.</p> <p style="text-align: justify;">Trong thời đại dữ liệu lớn, c&aacute;c nh&agrave; thực h&agrave;nh cần c&aacute;c c&ocirc;ng cụ nhanh v&agrave; đ&aacute;ng tin cậy hơn bao giờ hết để xử l&yacute; luồng dữ liệu. C&aacute;c c&ocirc;ng cụ trước đ&oacute; như MapReduce được y&ecirc;u th&iacute;ch nhưng rất chậm. Để khắc phục vấn đề n&agrave;y, Spark đưa ra một giải ph&aacute;p vừa nhanh ch&oacute;ng vừa c&oacute; mục đ&iacute;ch chung. Sự kh&aacute;c biệt ch&iacute;nh giữa Spark v&agrave; MapReduce l&agrave; n&oacute; chạy c&aacute;c t&iacute;nh to&aacute;n trong bộ nhớ trong thời gian sau đ&oacute; tr&ecirc;n đĩa cứng. N&oacute; cho ph&eacute;p truy cập v&agrave; xử l&yacute; dữ liệu tốc độ cao, giảm thời gian từ h&agrave;ng giờ xuống c&ograve;n ph&uacute;t.</p> <p style="text-align: justify;"><strong>Pyspark l&agrave; g&igrave;?</strong></p> <p style="text-align: justify;">Spark l&agrave; t&ecirc;n của c&ocirc;ng cụ để thực hiện t&iacute;nh to&aacute;n cụm trong khi PySpark l&agrave; thư viện của Python để sử dụng Spark.</p> <h3 id="ftoc-heading-2" class="ftwp-heading" style="text-align: justify;">Spark hoạt động như thế n&agrave;o?</h3> <p style="text-align: justify;">Spark dựa tr&ecirc;n c&ocirc;ng cụ t&iacute;nh to&aacute;n, c&oacute; nghĩa l&agrave; n&oacute; đảm nhiệm ứng dụng lập lịch, ph&acirc;n phối v&agrave; gi&aacute;m s&aacute;t. Mỗi t&aacute;c vụ được thực hiện tr&ecirc;n nhiều m&aacute;y worker kh&aacute;c nhau được gọi l&agrave; cụm m&aacute;y t&iacute;nh. Một cụm m&aacute;y t&iacute;nh đề cập đến việc ph&acirc;n chia c&aacute;c nhiệm vụ. Một m&aacute;y thực hiện một nhiệm vụ, trong khi những m&aacute;y kh&aacute;c đ&oacute;ng g&oacute;p v&agrave;o kết quả cuối c&ugrave;ng th&ocirc;ng qua một nhiệm vụ kh&aacute;c. Cuối c&ugrave;ng, tất cả c&aacute;c nhiệm vụ được tổng hợp lại để tạo ra một đầu ra.</p> <p style="text-align: justify;"><img class="aligncenter size-full wp-image-7388 disappear appear" src="../../wp-content/uploads/2021/01/082918_1213_ApacheSpark1.png" sizes="(max-width: 1536px) 100vw, 1536px" srcset="https://old.tek4.vn/wp-content/uploads/2021/01/082918_1213_ApacheSpark1.png 1536w, https://old.tek4.vn/wp-content/uploads/2021/01/082918_1213_ApacheSpark1-300x113.png 300w, https://old.tek4.vn/wp-content/uploads/2021/01/082918_1213_ApacheSpark1-1024x387.png 1024w, https://old.tek4.vn/wp-content/uploads/2021/01/082918_1213_ApacheSpark1-768x290.png 768w" alt="spark" width="1536" height="580" loading="lazy" /></p> <p style="text-align: justify;">Spark được thiết kế để l&agrave;m việc với:</p> <ul style="text-align: justify;"> <li>Python</li> <li>Java</li> <li>Scala</li> <li>SQL</li> </ul> <p style="text-align: justify;">Một t&iacute;nh năng quan trọng của Spark l&agrave; c&oacute; số lượng lớn thư viện t&iacute;ch hợp, bao gồm MLlib cho m&aacute;y học. N&oacute; cũng được thiết kế để hoạt động với c&aacute;c cụm Hadoop v&agrave; c&oacute; thể đọc nhiều loại tệp, bao gồm dữ liệu Hive, CSV, JSON, dữ liệu Casandra, v.v.</p> <h3 id="ftoc-heading-3" class="ftwp-heading" style="text-align: justify;"><span id="how_to_install_pyspark_on_aws">C&aacute;ch c&agrave;i đặt PySpark với AWS</span></h3> <p style="text-align: justify;">Nh&oacute;m Jupyter x&acirc;y dựng Docker image để chạy Spark một c&aacute;ch hiệu quả. Dưới đ&acirc;y l&agrave; c&aacute;c bước bạn c&oacute; thể l&agrave;m theo để c&agrave;i đặt phi&ecirc;n bản PySpark trong AWS.</p> <h4 id="ftoc-heading-4" class="ftwp-heading" style="text-align: justify;"><strong>Bước 1: Tạo một phi&ecirc;n bản</strong></h4> <p style="text-align: justify;">Trước hết ta cần tạo một instance. Truy cập t&agrave;i khoản AWS của bạn v&agrave; khởi chạy phi&ecirc;n bản. Bạn c&oacute; thể tăng dung lượng lưu trữ l&ecirc;n đến 15g v&agrave; sử dụng c&ugrave;ng một nh&oacute;m bảo mật như trong hướng dẫn của TensorFlow.</p> <h4 id="ftoc-heading-5" class="ftwp-heading" style="text-align: justify;"><strong>Bước 2: Mở kết nối</strong></h4> <p style="text-align: justify;">Mở kết nối v&agrave; c&agrave;i đặt bộ chứa docker. Lưu &yacute; rằng, bạn cần ở đ&uacute;ng thư mục l&agrave;m việc.</p> <p style="text-align: justify;">Chỉ cần chạy c&aacute;c m&atilde; n&agrave;y để c&agrave;i đặt Docker:</p> <div id="urvanov-syntax-highlighter-610ff0b7336f3792663404" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7336f3792663404-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7336f3792663404-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7336f3792663404-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7336f3792663404-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7336f3792663404-5">5</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7336f3792663404-1" class="crayon-line"><span class="crayon-e">sudo </span><span class="crayon-e">yum </span><span class="crayon-v">update</span> <span class="crayon-o">-</span><span class="crayon-i">y</span></div> <div id="urvanov-syntax-highlighter-610ff0b7336f3792663404-2" class="crayon-line crayon-striped-line"><span class="crayon-e">sudo </span><span class="crayon-e">yum </span><span class="crayon-v">install</span> <span class="crayon-o">-</span><span class="crayon-i">y</span> <span class="crayon-e">docker</span></div> <div id="urvanov-syntax-highlighter-610ff0b7336f3792663404-3" class="crayon-line"><span class="crayon-e">sudo </span><span class="crayon-e">service </span><span class="crayon-e">docker </span><span class="crayon-e">start</span></div> <div id="urvanov-syntax-highlighter-610ff0b7336f3792663404-4" class="crayon-line crayon-striped-line"><span class="crayon-e">sudo </span><span class="crayon-v">user</span><span class="crayon-o">-</span><span class="crayon-v">mod</span> <span class="crayon-o">-</span><span class="crayon-v">a</span> <span class="crayon-o">-</span><span class="crayon-i">G</span> <span class="crayon-e">docker </span><span class="crayon-v">ec2</span><span class="crayon-o">-</span><span class="crayon-e">user</span></div> <div id="urvanov-syntax-highlighter-610ff0b7336f3792663404-5" class="crayon-line"><span class="crayon-v">exit</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-6" class="ftwp-heading" style="text-align: justify;"><strong>Bước 3: Mở lại kết nối v&agrave; c&agrave;i đặt Spark</strong></h4> <p style="text-align: justify;">Sau khi mở lại kết nối, bạn c&oacute; thể c&agrave;i đặt image containing PySpark.</p> <div id="urvanov-syntax-highlighter-610ff0b733700886365878" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733700886365878-8">8</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-1" class="crayon-line"><span class="crayon-p">## Spark</span></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-2" class="crayon-line crayon-striped-line"><span class="crayon-e">docker </span><span class="crayon-v">run</span> <span class="crayon-o">-</span><span class="crayon-v">v</span> <span class="crayon-o">~</span><span class="crayon-o">/</span><span class="crayon-v">work</span><span class="crayon-o">:</span><span class="crayon-o">/</span><span class="crayon-v">home</span><span class="crayon-o">/</span><span class="crayon-v">jovyan</span><span class="crayon-o">/</span><span class="crayon-v">work</span> <span class="crayon-o">-</span><span class="crayon-v">d</span> <span class="crayon-o">-</span><span class="crayon-i">p</span> <span class="crayon-cn">8888</span><span class="crayon-o">:</span><span class="crayon-cn">8888</span> <span class="crayon-v">jupyter</span><span class="crayon-o">/</span><span class="crayon-v">pyspark</span><span class="crayon-o">-</span><span class="crayon-i">notebook</span></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-4" class="crayon-line crayon-striped-line"><span class="crayon-p">## Allow preserving Jupyter notebook</span></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-5" class="crayon-line"><span class="crayon-e">sudo </span><span class="crayon-i">chown</span> <span class="crayon-cn">1000</span> <span class="crayon-o">~</span><span class="crayon-o">/</span><span class="crayon-i">work</span></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-6" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-7" class="crayon-line"><span class="crayon-p">## Install tree to see our working directory next</span></div> <div id="urvanov-syntax-highlighter-610ff0b733700886365878-8" class="crayon-line crayon-striped-line"><span class="crayon-e">sudo </span><span class="crayon-e">yum </span><span class="crayon-v">install</span> <span class="crayon-o">-</span><span class="crayon-i">y</span> <span class="crayon-v">tree</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-7" class="ftwp-heading" style="text-align: justify;"><strong>Bước 4: Mở Jupyter</strong></h4> <p style="text-align: justify;">Kiểm tra v&ugrave;ng chứa v&agrave; t&ecirc;n của n&oacute;</p> <div id="urvanov-syntax-highlighter-610ff0b733704461102807" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733704461102807-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733704461102807-1" class="crayon-line"><span class="crayon-e">docker </span><span class="crayon-v">ps</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Khởi chạy docker với nhật k&yacute; của docker theo sau l&agrave; t&ecirc;n của docker. V&iacute; dụ: docker logs zealous_goldwasser.</p> <p style="text-align: justify;">Truy cập tr&igrave;nh duyệt của bạn v&agrave; khởi chạy Jupyter. Địa chỉ l&agrave; http://localhost:8888/. D&aacute;n mật khẩu do cmd cung cấp.</p> <p style="text-align: justify;">Lưu &yacute;: nếu bạn muốn upload/download m&aacute;y AWS của m&igrave;nh, bạn c&oacute; thể sử dụng phần mềm&nbsp;<a href="https://cyberduck.io/" target="_blank" rel="noopener">Cyberduck</a></p> <h3 id="ftoc-heading-8" class="ftwp-heading" style="text-align: justify;"><span id="installing_pyspark_on_windows_and_mac_with_anaconda">C&aacute;ch c&agrave;i đặt PySpark tr&ecirc;n Windows / Mac với Conda</span></h3> <p style="text-align: justify;">Sau đ&acirc;y l&agrave; quy tr&igrave;nh chi tiết về c&aacute;ch c&agrave;i đặt PySpark tr&ecirc;n Windows / Mac bằng Anaconda:</p> <p style="text-align: justify;">Để c&agrave;i đặt Spark tr&ecirc;n m&aacute;y cục bộ của bạn, một phương ph&aacute;p được khuyến nghị l&agrave; tạo một conda environment mới.</p> <p style="text-align: justify;">M&ocirc;i trường mới n&agrave;y sẽ c&agrave;i đặt Python 3.6, Spark v&agrave; tất cả c&aacute;c phụ thuộc.</p> <p style="text-align: justify;"><strong>Người d&ugrave;ng Mac</strong></p> <div id="urvanov-syntax-highlighter-610ff0b733708769009318" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733708769009318-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733708769009318-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733708769009318-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733708769009318-1" class="crayon-line"><span class="crayon-e">cd </span><span class="crayon-e">anaconda3</span></div> <div id="urvanov-syntax-highlighter-610ff0b733708769009318-2" class="crayon-line crayon-striped-line"><span class="crayon-e">touch </span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span><span class="crayon-sy">.</span><span class="crayon-e">yml</span></div> <div id="urvanov-syntax-highlighter-610ff0b733708769009318-3" class="crayon-line"><span class="crayon-e">vi </span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span><span class="crayon-sy">.</span><span class="crayon-v">yml</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Người d&ugrave;ng Windows</strong></p> <div id="urvanov-syntax-highlighter-610ff0b73370b441529691" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370b441529691-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370b441529691-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370b441529691-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73370b441529691-1" class="crayon-line"><span class="crayon-i">cd</span> <span class="crayon-v">C</span><span class="crayon-o">:</span><span class="crayon-sy">\</span><span class="crayon-v">Users</span><span class="crayon-sy">\</span><span class="crayon-v">Admin</span><span class="crayon-sy">\</span><span class="crayon-e">Anaconda3</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370b441529691-2" class="crayon-line crayon-striped-line"><span class="crayon-v">echo</span><span class="crayon-sy">.</span><span class="crayon-o">&gt;</span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span><span class="crayon-sy">.</span><span class="crayon-e">yml</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370b441529691-3" class="crayon-line"><span class="crayon-e">notepad </span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span><span class="crayon-sy">.</span><span class="crayon-v">yml</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể chỉnh sửa tệp .yml. H&atilde;y thận trọng với phần thụt lề. Cần c&oacute; hai dấu c&aacute;ch trước &ndash;</p> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73370f437076032-12">12</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-1" class="crayon-line"><span class="crayon-v">name</span><span class="crayon-o">:</span> <span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-e">spark </span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-2" class="crayon-line crayon-striped-line"><span class="crayon-e">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">dependencies</span><span class="crayon-o">:</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">python</span><span class="crayon-o">=</span><span class="crayon-cn">3.6</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">jupyter</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">ipython</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">numpy</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">numpy</span><span class="crayon-o">-</span><span class="crayon-v">base</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-9" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">pandas</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-10" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">py4j</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-11" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">pyspark</span></div> <div id="urvanov-syntax-highlighter-610ff0b73370f437076032-12" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">-</span> <span class="crayon-v">pytz</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Lưu n&oacute; v&agrave; tạo m&ocirc;i trường. Tốn một ch&uacute;t thời gian</p> <div id="urvanov-syntax-highlighter-610ff0b733712963238175" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733712963238175-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733712963238175-1" class="crayon-line"><span class="crayon-e">conda </span><span class="crayon-e">env </span><span class="crayon-v">create</span> <span class="crayon-o">-</span><span class="crayon-i">f</span> <span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span><span class="crayon-sy">.</span><span class="crayon-v">yml</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể kiểm tra tất cả m&ocirc;i trường được c&agrave;i đặt trong m&aacute;y của m&igrave;nh</p> <div id="urvanov-syntax-highlighter-610ff0b733715559473751" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733715559473751-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733715559473751-1" class="crayon-line"><span class="crayon-e">conda </span><span class="crayon-e">env </span><span class="crayon-v">list</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733718292652523" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733718292652523-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733718292652523-1" class="crayon-line"><span class="crayon-e">Activate </span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Người d&ugrave;ng Mac</strong></p> <div id="urvanov-syntax-highlighter-610ff0b73371a166592523" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73371a166592523-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73371a166592523-1" class="crayon-line"><span class="crayon-e">source </span><span class="crayon-e">activate </span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-i">spark</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Người d&ugrave;ng Windows</strong></p> <div id="urvanov-syntax-highlighter-610ff0b73371d019993230" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73371d019993230-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73371d019993230-1" class="crayon-line"><span class="crayon-e">activate </span><span class="crayon-v">hello</span><span class="crayon-o">-</span><span class="crayon-v">spark</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Lưu &yacute;: Bạn đ&atilde; tạo một m&ocirc;i trường TensorFlow cụ thể để chạy c&aacute;c hướng dẫn tr&ecirc;n TensorFlow. Sẽ thuận tiện hơn khi tạo một m&ocirc;i trường mới kh&aacute;c với hello-tf.</p> <p style="text-align: justify;">H&atilde;y tưởng tượng hầu hết dự &aacute;n của bạn li&ecirc;n quan đến TensorFlow, nhưng bạn cần sử dụng Spark cho một dự &aacute;n cụ thể. Bạn c&oacute; thể đặt m&ocirc;i trường TensorFlow cho tất cả dự &aacute;n của m&igrave;nh v&agrave; tạo m&ocirc;i trường ri&ecirc;ng cho Spark. Bạn c&oacute; thể th&ecirc;m bao nhi&ecirc;u thư viện trong m&ocirc;i trường Spark t&ugrave;y th&iacute;ch m&agrave; kh&ocirc;ng cần can thiệp v&agrave;o m&ocirc;i trường TensorFlow. Sau khi ho&agrave;n th&agrave;nh dự &aacute;n của Spark, bạn c&oacute; thể x&oacute;a n&oacute; m&agrave; kh&ocirc;ng ảnh hưởng đến m&ocirc;i trường TensorFlow.</p> <p style="text-align: justify;"><strong>Jupyter</strong></p> <p style="text-align: justify;">Mở Jupyter Notebook v&agrave; thử xem PySpark c&oacute; hoạt động kh&ocirc;ng. Trong sổ tay mới, h&atilde;y d&aacute;n m&atilde; mẫu PySpark sau:</p> <div id="urvanov-syntax-highlighter-610ff0b733721951400327" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733721951400327-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733721951400327-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733721951400327-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733721951400327-1" class="crayon-line"><span class="crayon-r">import</span> <span class="crayon-e">pyspark</span></div> <div id="urvanov-syntax-highlighter-610ff0b733721951400327-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-e">pyspark </span><span class="crayon-r">import</span> <span class="crayon-e">SparkContext</span></div> <div id="urvanov-syntax-highlighter-610ff0b733721951400327-3" class="crayon-line"><span class="crayon-v">sc</span> <span class="crayon-o">=</span><span class="crayon-e">SparkContext</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Nếu lỗi hiển thị, c&oacute; thể l&agrave; Java chưa được c&agrave;i đặt tr&ecirc;n m&aacute;y của bạn. Trong mac, mở terminal v&agrave; viết java -version, nếu c&oacute; phi&ecirc;n bản java, h&atilde;y đảm bảo rằng n&oacute; l&agrave; 1.8. Trong Windows, đi tới Application v&agrave; kiểm tra xem c&oacute; thư mục Java kh&ocirc;ng. Nếu c&oacute; một thư mục Java, h&atilde;y kiểm tra xem Java 1.8 đ&atilde; được c&agrave;i đặt chưa.</p> <p style="text-align: justify;">Nếu bạn cần c&agrave;i đặt Java, bạn h&atilde;y truy cập&nbsp;<a href="http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html">link</a>&nbsp;v&agrave; tải xuống jdk-8u181-windows-x64.exe.</p> <p style="text-align: justify;"><img class="aligncenter size-full wp-image-8511 disappear appear" src="../../wp-content/uploads/2021/02/082918_1213_ApacheSpark2.png" sizes="(max-width: 649px) 100vw, 649px" srcset="https://old.tek4.vn/wp-content/uploads/2021/02/082918_1213_ApacheSpark2.png 649w, https://old.tek4.vn/wp-content/uploads/2021/02/082918_1213_ApacheSpark2-300x166.png 300w" alt="" width="649" height="360" loading="lazy" /></p> <p style="text-align: justify;">Đối với Người d&ugrave;ng Mac, n&ecirc;n sử dụng `brew.</p> <div id="urvanov-syntax-highlighter-610ff0b733725029805370" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733725029805370-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733725029805370-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733725029805370-1" class="crayon-line"><span class="crayon-e">brew </span><span class="crayon-e">tap </span><span class="crayon-v">caskroom</span><span class="crayon-o">/</span><span class="crayon-e">versions</span></div> <div id="urvanov-syntax-highlighter-610ff0b733725029805370-2" class="crayon-line crayon-striped-line"><span class="crayon-e">brew </span><span class="crayon-e">cask </span><span class="crayon-e">install </span><span class="crayon-v">java8</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h3 id="ftoc-heading-9" class="ftwp-heading" style="text-align: justify;">Spark Context</h3> <p style="text-align: justify;">SparkContext l&agrave; c&ocirc;ng cụ b&ecirc;n trong cho ph&eacute;p kết nối với c&aacute;c clusters. Nếu bạn muốn chạy một hoạt động, bạn cần c&oacute; SparkContext.</p> <p style="text-align: justify;"><strong>Tạo một SparkContext</strong></p> <p style="text-align: justify;">Trước hết, bạn cần khởi tạo SparkContext.</p> <div id="urvanov-syntax-highlighter-610ff0b733729748105245" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733729748105245-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733729748105245-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733729748105245-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733729748105245-1" class="crayon-line"><span class="crayon-r">import</span> <span class="crayon-e">pyspark</span></div> <div id="urvanov-syntax-highlighter-610ff0b733729748105245-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-e">pyspark </span><span class="crayon-r">import</span> <span class="crayon-e">SparkContext</span></div> <div id="urvanov-syntax-highlighter-610ff0b733729748105245-3" class="crayon-line"><span class="crayon-v">sc</span> <span class="crayon-o">=</span><span class="crayon-e">SparkContext</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">B&acirc;y giờ SparkContext đ&atilde; sẵn s&agrave;ng, bạn c&oacute; thể tạo một bộ sưu tập dữ liệu được gọi l&agrave; RDD, Tập dữ liệu ph&acirc;n t&aacute;n phục hồi (Resilient Distributed Dataset). T&iacute;nh to&aacute;n trong RDD được tự động song song tr&ecirc;n to&agrave;n cluster.</p> <div id="urvanov-syntax-highlighter-610ff0b73372c903926635" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73372c903926635-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73372c903926635-1" class="crayon-line"><span class="crayon-v">nums</span><span class="crayon-o">=</span> <span class="crayon-v">sc</span><span class="crayon-sy">.</span><span class="crayon-e">parallelize</span><span class="crayon-sy">(</span><span class="crayon-sy">[</span><span class="crayon-cn">1</span><span class="crayon-sy">,</span><span class="crayon-cn">2</span><span class="crayon-sy">,</span><span class="crayon-cn">3</span><span class="crayon-sy">,</span><span class="crayon-cn">4</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể truy cập h&agrave;ng đầu ti&ecirc;n</p> <div id="urvanov-syntax-highlighter-610ff0b73372f240768995" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73372f240768995-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73372f240768995-1" class="crayon-line"><span class="crayon-v">nums</span><span class="crayon-sy">.</span><span class="crayon-e">take</span><span class="crayon-sy">(</span><span class="crayon-cn">1</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733732558871919" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733732558871919-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733732558871919-1" class="crayon-line"><span class="crayon-sy">[</span><span class="crayon-cn">1</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể &aacute;p dụng một ph&eacute;p chuyển đổi cho dữ liệu bằng một h&agrave;m lambda. Trong v&iacute; dụ PySpark b&ecirc;n dưới, bạn trả về b&igrave;nh phương của nums. Đ&oacute; l&agrave; một sự chuyển đổi map.</p> <div id="urvanov-syntax-highlighter-610ff0b733735182400622" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733735182400622-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733735182400622-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733735182400622-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733735182400622-1" class="crayon-line"><span class="crayon-v">squared</span> <span class="crayon-o">=</span> <span class="crayon-v">nums</span><span class="crayon-sy">.</span><span class="crayon-k ">map</span><span class="crayon-sy">(</span><span class="crayon-r">lambda</span> <span class="crayon-v">x</span><span class="crayon-o">:</span> <span class="crayon-v">x</span><span class="crayon-o">*</span><span class="crayon-v">x</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">collect</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733735182400622-2" class="crayon-line crayon-striped-line"><span class="crayon-st">for</span> <span class="crayon-e">num </span><span class="crayon-st">in</span> <span class="crayon-v">squared</span><span class="crayon-o">:</span></div> <div id="urvanov-syntax-highlighter-610ff0b733735182400622-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-s">'%i '</span> <span class="crayon-o">%</span> <span class="crayon-sy">(</span><span class="crayon-v">num</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733738242548649" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733738242548649-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733738242548649-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733738242548649-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733738242548649-4">4</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733738242548649-1" class="crayon-line"><span class="crayon-cn">1</span></div> <div id="urvanov-syntax-highlighter-610ff0b733738242548649-2" class="crayon-line crayon-striped-line"><span class="crayon-cn">4</span></div> <div id="urvanov-syntax-highlighter-610ff0b733738242548649-3" class="crayon-line"><span class="crayon-cn">9</span></div> <div id="urvanov-syntax-highlighter-610ff0b733738242548649-4" class="crayon-line crayon-striped-line"><span class="crayon-cn">16</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h3 id="ftoc-heading-10" class="ftwp-heading" style="text-align: justify;">SQLContext</h3> <p style="text-align: justify;">Một c&aacute;ch thuận tiện hơn l&agrave; sử dụng DataFrame. SparkContext đ&atilde; được thiết lập, bạn c&oacute; thể sử dụng n&oacute; để tạo dataFrame. Bạn cũng cần phải khai b&aacute;o SQLContext.</p> <p style="text-align: justify;">SQLContext cho ph&eacute;p kết nối engine với c&aacute;c nguồn dữ liệu kh&aacute;c nhau. N&oacute; được sử dụng để khởi tạo c&aacute;c h&agrave;m của Spark SQL.</p> <div id="urvanov-syntax-highlighter-610ff0b73373b610855506" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73373b610855506-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73373b610855506-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73373b610855506-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73373b610855506-4">4</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73373b610855506-1" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-e">sql </span><span class="crayon-r">import</span> <span class="crayon-e">Row</span></div> <div id="urvanov-syntax-highlighter-610ff0b73373b610855506-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-e">sql </span><span class="crayon-r">import</span> <span class="crayon-e">SQLContext</span></div> <div id="urvanov-syntax-highlighter-610ff0b73373b610855506-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b73373b610855506-4" class="crayon-line crayon-striped-line"><span class="crayon-v">sqlContext</span> <span class="crayon-o">=</span> <span class="crayon-e">SQLContext</span><span class="crayon-sy">(</span><span class="crayon-v">sc</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">B&acirc;y giờ, h&atilde;y tạo một danh s&aacute;ch c&aacute;c tuple. Mỗi tuple sẽ chứa t&ecirc;n của mọi người v&agrave; tuổi của họ. Bốn bước được y&ecirc;u cầu:</p> <p style="text-align: justify;">Bước 1) Tạo danh s&aacute;ch c&aacute;c tuple với th&ocirc;ng tin.</p> <div id="urvanov-syntax-highlighter-610ff0b73373e249315504" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73373e249315504-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73373e249315504-1" class="crayon-line"><span class="crayon-sy">[</span><span class="crayon-sy">(</span><span class="crayon-s">'John'</span><span class="crayon-sy">,</span><span class="crayon-cn">19</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span><span class="crayon-sy">(</span><span class="crayon-s">'Smith'</span><span class="crayon-sy">,</span><span class="crayon-cn">29</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span><span class="crayon-sy">(</span><span class="crayon-s">'Adam'</span><span class="crayon-sy">,</span><span class="crayon-cn">35</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span><span class="crayon-sy">(</span><span class="crayon-s">'Henry'</span><span class="crayon-sy">,</span><span class="crayon-cn">50</span><span class="crayon-sy">)</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bước 2) X&acirc;y dựng RDD</p> <div id="urvanov-syntax-highlighter-610ff0b733741968495094" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733741968495094-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733741968495094-1" class="crayon-line"><span class="crayon-v">rdd</span> <span class="crayon-o">=</span> <span class="crayon-v">sc</span><span class="crayon-sy">.</span><span class="crayon-e">parallelize</span><span class="crayon-sy">(</span><span class="crayon-v">list_p</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bước 3) Chuyển đổi c&aacute;c tuple</p> <div id="urvanov-syntax-highlighter-610ff0b733744088807930" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733744088807930-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733744088807930-1" class="crayon-line"><span class="crayon-v">rdd</span><span class="crayon-sy">.</span><span class="crayon-k ">map</span><span class="crayon-sy">(</span><span class="crayon-r">lambda</span> <span class="crayon-v">x</span><span class="crayon-o">:</span> <span class="crayon-e">Row</span><span class="crayon-sy">(</span><span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-cn">0</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span> <span class="crayon-v">age</span><span class="crayon-o">=</span><span class="crayon-k ">int</span><span class="crayon-sy">(</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-cn">1</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bước 4) Tạo DataFrame context</p> <div id="urvanov-syntax-highlighter-610ff0b733747582308353" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733747582308353-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733747582308353-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733747582308353-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733747582308353-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733747582308353-5">5</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733747582308353-1" class="crayon-line"><span class="crayon-v">sqlContext</span><span class="crayon-sy">.</span><span class="crayon-e">createDataFrame</span><span class="crayon-sy">(</span><span class="crayon-v">ppl</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733747582308353-2" class="crayon-line crayon-striped-line"><span class="crayon-v">list_p</span> <span class="crayon-o">=</span> <span class="crayon-sy">[</span><span class="crayon-sy">(</span><span class="crayon-s">'John'</span><span class="crayon-sy">,</span><span class="crayon-cn">19</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span><span class="crayon-sy">(</span><span class="crayon-s">'Smith'</span><span class="crayon-sy">,</span><span class="crayon-cn">29</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span><span class="crayon-sy">(</span><span class="crayon-s">'Adam'</span><span class="crayon-sy">,</span><span class="crayon-cn">35</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span><span class="crayon-sy">(</span><span class="crayon-s">'Henry'</span><span class="crayon-sy">,</span><span class="crayon-cn">50</span><span class="crayon-sy">)</span><span class="crayon-sy">]</span></div> <div id="urvanov-syntax-highlighter-610ff0b733747582308353-3" class="crayon-line"><span class="crayon-v">rdd</span> <span class="crayon-o">=</span> <span class="crayon-v">sc</span><span class="crayon-sy">.</span><span class="crayon-e">parallelize</span><span class="crayon-sy">(</span><span class="crayon-v">list_p</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733747582308353-4" class="crayon-line crayon-striped-line"><span class="crayon-v">ppl</span> <span class="crayon-o">=</span> <span class="crayon-v">rdd</span><span class="crayon-sy">.</span><span class="crayon-k ">map</span><span class="crayon-sy">(</span><span class="crayon-r">lambda</span> <span class="crayon-v">x</span><span class="crayon-o">:</span> <span class="crayon-e">Row</span><span class="crayon-sy">(</span><span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-cn">0</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span> <span class="crayon-v">age</span><span class="crayon-o">=</span><span class="crayon-k ">int</span><span class="crayon-sy">(</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-cn">1</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733747582308353-5" class="crayon-line"><span class="crayon-v">DF_ppl</span> <span class="crayon-o">=</span> <span class="crayon-v">sqlContext</span><span class="crayon-sy">.</span><span class="crayon-e">createDataFrame</span><span class="crayon-sy">(</span><span class="crayon-v">ppl</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Nếu bạn muốn truy cập type của từng đặc trưng, bạn c&oacute; thể sử dụng printSchema().</p> <div id="urvanov-syntax-highlighter-610ff0b73374a461248155" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73374a461248155-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73374a461248155-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73374a461248155-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73374a461248155-4">4</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73374a461248155-1" class="crayon-line"><span class="crayon-v">DF_ppl</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374a461248155-2" class="crayon-line crayon-striped-line"><span class="crayon-v">root</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374a461248155-3" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">age</span><span class="crayon-o">:</span> <span class="crayon-t">long</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374a461248155-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">name</span><span class="crayon-o">:</span> <span class="crayon-t">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h3 id="ftoc-heading-11" class="ftwp-heading" style="text-align: justify;">V&iacute; dụ về học m&aacute;y với PySpark</h3> <p style="text-align: justify;">B&acirc;y giờ bạn đ&atilde; c&oacute; &yacute; tưởng ngắn gọn về Spark v&agrave; SQLContext, bạn đ&atilde; sẵn s&agrave;ng x&acirc;y dựng chương tr&igrave;nh M&aacute;y học đầu ti&ecirc;n của m&igrave;nh.</p> <p style="text-align: justify;">Sau đ&acirc;y l&agrave; c&aacute;c bước để x&acirc;y dựng một chương tr&igrave;nh Học m&aacute;y với PySpark:</p> <ul style="text-align: justify;"> <li>Bước 1) Hoạt động cơ bản với PySpark</li> <li>Bước 2) Tiền xử l&yacute; dữ liệu</li> <li>Bước 3) X&acirc;y dựng pipeline xử l&yacute; dữ liệu</li> <li>Bước 4) X&acirc;y dựng bộ ph&acirc;n loại: logistic</li> <li>Bước 5) Đ&agrave;o tạo v&agrave; đ&aacute;nh gi&aacute; m&ocirc; h&igrave;nh</li> <li>Bước 6) Điều chỉnh si&ecirc;u tham số</li> </ul> <p style="text-align: justify;">ch&uacute;ng ta sẽ sử dụng tập dữ liệu adult dataset. Mục đ&iacute;ch của hướng dẫn n&agrave;y l&agrave; để học c&aacute;ch sử dụng Pyspark.</p> <h4 id="ftoc-heading-12" class="ftwp-heading" style="text-align: justify;">Bước 1) Hoạt động cơ bản với PySpark</h4> <p style="text-align: justify;">Trước hết, bạn cần khởi tạo SQLContext.</p> <div id="urvanov-syntax-highlighter-610ff0b73374e143695730" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73374e143695730-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73374e143695730-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73374e143695730-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73374e143695730-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73374e143695730-5">5</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73374e143695730-1" class="crayon-line"><span class="crayon-c">#from pyspark.sql import SQLContext</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374e143695730-2" class="crayon-line crayon-striped-line"><span class="crayon-v">url</span> <span class="crayon-o">=</span> <span class="crayon-s">"data csv của bạn"</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374e143695730-3" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-e">pyspark </span><span class="crayon-r">import</span> <span class="crayon-e">SparkFiles</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374e143695730-4" class="crayon-line crayon-striped-line"><span class="crayon-v">sc</span><span class="crayon-sy">.</span><span class="crayon-e">addFile</span><span class="crayon-sy">(</span><span class="crayon-v">url</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73374e143695730-5" class="crayon-line"><span class="crayon-v">sqlContext</span> <span class="crayon-o">=</span> <span class="crayon-e">SQLContext</span><span class="crayon-sy">(</span><span class="crayon-v">sc</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Sau đ&oacute;, bạn c&oacute; thể đọc tệp cvs bằng sqlContext.read.csv. Sử dụng &nbsp;inferSchema được đặt th&agrave;nh True để y&ecirc;u cầu Spark tự động đo&aacute;n loại dữ liệu. Theo mặc định, n&oacute; chuyển th&agrave;nh False.</p> <div id="urvanov-syntax-highlighter-610ff0b733752871908851" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733752871908851-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733752871908851-1" class="crayon-line"><span class="crayon-v">df</span> <span class="crayon-o">=</span> <span class="crayon-v">sqlContext</span><span class="crayon-sy">.</span><span class="crayon-v">read</span><span class="crayon-sy">.</span><span class="crayon-k ">csv</span><span class="crayon-sy">(</span><span class="crayon-v">SparkFiles</span><span class="crayon-sy">.</span><span class="crayon-e">get</span><span class="crayon-sy">(</span><span class="crayon-s">"adult_data.csv"</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">header</span><span class="crayon-o">=</span><span class="crayon-t">True</span><span class="crayon-sy">,</span> <span class="crayon-v">inferSchema</span><span class="crayon-o">=</span> <span class="crayon-t">True</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">H&atilde;y xem kiểu dữ liệu</p> <div id="urvanov-syntax-highlighter-610ff0b733755005072390" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733755005072390-17">17</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-2" class="crayon-line crayon-striped-line"><span class="crayon-v">root</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-3" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">age</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">workclass</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">fnlwgt</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education_num</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">marital</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">occupation</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">relationship</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">race</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">sex</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_gain</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_loss</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">hours_week</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">native_country</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733755005072390-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">label</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể xem dữ liệu với show.</p> <div id="urvanov-syntax-highlighter-610ff0b733758698186079" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733758698186079-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733758698186079-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-cn">5</span><span class="crayon-sy">,</span> <span class="crayon-v">truncate</span> <span class="crayon-o">=</span> <span class="crayon-t">False</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73375b047000095-10">10</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">age</span><span class="crayon-o">|</span><span class="crayon-v">workclass</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">fnlwgt</span><span class="crayon-o">|</span><span class="crayon-v">education</span><span class="crayon-o">|</span><span class="crayon-v">education_num</span><span class="crayon-o">|</span><span class="crayon-v">marital</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">occupation</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">relationship</span> <span class="crayon-o">|</span><span class="crayon-v">race</span> <span class="crayon-o">|</span><span class="crayon-v">sex</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">capital_gain</span><span class="crayon-o">|</span><span class="crayon-v">capital_loss</span><span class="crayon-o">|</span><span class="crayon-v">hours_week</span><span class="crayon-o">|</span><span class="crayon-v">native_country</span><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-cn">39</span> <span class="crayon-o">|</span><span class="crayon-v">State</span><span class="crayon-o">-</span><span class="crayon-v">gov</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">77516</span> <span class="crayon-o">|</span><span class="crayon-v">Bachelors</span><span class="crayon-o">|</span><span class="crayon-cn">13</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">Never</span><span class="crayon-o">-</span><span class="crayon-v">married</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">Adm</span><span class="crayon-o">-</span><span class="crayon-v">clerical</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-st">Not</span><span class="crayon-o">-</span><span class="crayon-st">in</span><span class="crayon-o">-</span><span class="crayon-v">family</span><span class="crayon-o">|</span><span class="crayon-v">White</span><span class="crayon-o">|</span><span class="crayon-v">Male</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-cn">2174</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">40</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">United</span><span class="crayon-o">-</span><span class="crayon-v">States</span> <span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-cn">50</span> <span class="crayon-o">|</span><span class="crayon-r">Self</span><span class="crayon-o">-</span><span class="crayon-v">emp</span><span class="crayon-o">-</span><span class="crayon-st">not</span><span class="crayon-o">-</span><span class="crayon-v">inc</span><span class="crayon-o">|</span><span class="crayon-cn">83311</span> <span class="crayon-o">|</span><span class="crayon-v">Bachelors</span><span class="crayon-o">|</span><span class="crayon-cn">13</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">civ</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">|</span><span class="crayon-r">Exec</span><span class="crayon-o">-</span><span class="crayon-v">managerial</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">Husband</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">White</span><span class="crayon-o">|</span><span class="crayon-v">Male</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">13</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">United</span><span class="crayon-o">-</span><span class="crayon-v">States</span> <span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-cn">38</span> <span class="crayon-o">|</span><span class="crayon-v">Private</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">215646</span><span class="crayon-o">|</span><span class="crayon-v">HS</span><span class="crayon-o">-</span><span class="crayon-v">grad</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-cn">9</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">Divorced</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">Handlers</span><span class="crayon-o">-</span><span class="crayon-v">cleaners</span><span class="crayon-o">|</span><span class="crayon-st">Not</span><span class="crayon-o">-</span><span class="crayon-st">in</span><span class="crayon-o">-</span><span class="crayon-v">family</span><span class="crayon-o">|</span><span class="crayon-v">White</span><span class="crayon-o">|</span><span class="crayon-v">Male</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">40</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">United</span><span class="crayon-o">-</span><span class="crayon-v">States</span> <span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-cn">53</span> <span class="crayon-o">|</span><span class="crayon-v">Private</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">234721</span><span class="crayon-o">|</span><span class="crayon-cn">11th</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">7</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">civ</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">|</span><span class="crayon-v">Handlers</span><span class="crayon-o">-</span><span class="crayon-v">cleaners</span><span class="crayon-o">|</span><span class="crayon-v">Husband</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">Black</span><span class="crayon-o">|</span><span class="crayon-v">Male</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">40</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">United</span><span class="crayon-o">-</span><span class="crayon-v">States</span> <span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-cn">28</span> <span class="crayon-o">|</span><span class="crayon-v">Private</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">338409</span><span class="crayon-o">|</span><span class="crayon-v">Bachelors</span><span class="crayon-o">|</span><span class="crayon-cn">13</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">civ</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">|</span><span class="crayon-v">Prof</span><span class="crayon-o">-</span><span class="crayon-v">specialty</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">Wife</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-v">Black</span><span class="crayon-o">|</span><span class="crayon-v">Female</span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">0</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-o">|</span><span class="crayon-cn">40</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-v">Cuba</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-9" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73375b047000095-10" class="crayon-line crayon-striped-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">5</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Nếu bạn kh&ocirc;ng đặt inderShema th&agrave;nh True, đ&acirc;y l&agrave; những g&igrave; đang xảy ra với type. C&oacute; tất cả trong chuỗi.</p> <div id="urvanov-syntax-highlighter-610ff0b733760263714307" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733760263714307-18">18</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-1" class="crayon-line"><span class="crayon-v">df_string</span> <span class="crayon-o">=</span> <span class="crayon-v">sqlContext</span><span class="crayon-sy">.</span><span class="crayon-v">read</span><span class="crayon-sy">.</span><span class="crayon-k ">csv</span><span class="crayon-sy">(</span><span class="crayon-v">SparkFiles</span><span class="crayon-sy">.</span><span class="crayon-e">get</span><span class="crayon-sy">(</span><span class="crayon-s">"adult.csv"</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">header</span><span class="crayon-o">=</span><span class="crayon-t">True</span><span class="crayon-sy">,</span> <span class="crayon-v">inferSchema</span><span class="crayon-o">=</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-t">False</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-2" class="crayon-line crayon-striped-line"><span class="crayon-v">df_string</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-3" class="crayon-line"><span class="crayon-v">root</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">age</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">workclass</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">fnlwgt</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education_num</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">marital</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">occupation</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">relationship</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">race</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">sex</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_gain</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_loss</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">hours_week</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">native_country</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733760263714307-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">label</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Để chuyển đổi biến li&ecirc;n tục theo đ&uacute;ng định dạng, bạn c&oacute; thể sử dụng c&aacute;c cột. Bạn c&oacute; thể sử dụng withColumn để cho Spark biết cột n&agrave;o sẽ hoạt động chuyển đổi.</p> <div id="urvanov-syntax-highlighter-610ff0b733764748228214" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-18">18</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-19">19</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-20">20</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-21">21</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-22">22</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-23">23</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-24">24</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-25">25</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-26">26</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-27">27</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-28">28</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-29">29</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-30">30</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-31">31</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-32">32</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-33">33</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-34">34</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-35">35</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733764748228214-36">36</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-1" class="crayon-line"><span class="crayon-c"># Import all from `sql.types`</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">sql</span><span class="crayon-sy">.</span><span class="crayon-k ">types</span> <span class="crayon-r">import</span> <span class="crayon-o">*</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-4" class="crayon-line crayon-striped-line"><span class="crayon-c"># Write a custom function to convert the data type of DataFrame columns</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-5" class="crayon-line"><span class="crayon-r">def</span> <span class="crayon-e">convertColumn</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">,</span> <span class="crayon-v">names</span><span class="crayon-sy">,</span> <span class="crayon-v">newType</span><span class="crayon-sy">)</span><span class="crayon-o">:</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-st">for</span> <span class="crayon-e">name </span><span class="crayon-st">in</span> <span class="crayon-v">names</span><span class="crayon-o">:</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">df</span> <span class="crayon-o">=</span> <span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">withColumn</span><span class="crayon-sy">(</span><span class="crayon-v">name</span><span class="crayon-sy">,</span> <span class="crayon-v">df</span><span class="crayon-sy">[</span><span class="crayon-v">name</span><span class="crayon-sy">]</span><span class="crayon-sy">.</span><span class="crayon-e">cast</span><span class="crayon-sy">(</span><span class="crayon-v">newType</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-st">return</span> <span class="crayon-i">df</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-9" class="crayon-line"><span class="crayon-c"># List of continuous features</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-10" class="crayon-line crayon-striped-line"><span class="crayon-v">CONTI_FEATURES</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-o">=</span> <span class="crayon-sy">[</span><span class="crayon-s">'age'</span><span class="crayon-sy">,</span> <span class="crayon-s">'fnlwgt'</span><span class="crayon-sy">,</span><span class="crayon-s">'capital_gain'</span><span class="crayon-sy">,</span> <span class="crayon-s">'education_num'</span><span class="crayon-sy">,</span> <span class="crayon-s">'capital_loss'</span><span class="crayon-sy">,</span> <span class="crayon-s">'hours_week'</span><span class="crayon-sy">]</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-11" class="crayon-line"><span class="crayon-c"># Convert the type</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-12" class="crayon-line crayon-striped-line"><span class="crayon-v">df_string</span> <span class="crayon-o">=</span> <span class="crayon-e">convertColumn</span><span class="crayon-sy">(</span><span class="crayon-v">df_string</span><span class="crayon-sy">,</span> <span class="crayon-v">CONTI_FEATURES</span><span class="crayon-sy">,</span> <span class="crayon-e">FloatType</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-13" class="crayon-line"><span class="crayon-c"># Check the dataset</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-14" class="crayon-line crayon-striped-line"><span class="crayon-v">df_string</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-15" class="crayon-line"><span class="crayon-v">root</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">age</span><span class="crayon-o">:</span> <span class="crayon-k ">float</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">workclass</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">fnlwgt</span><span class="crayon-o">:</span> <span class="crayon-k ">float</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-19" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-20" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education_num</span><span class="crayon-o">:</span> <span class="crayon-k ">float</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-21" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">marital</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-22" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">occupation</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-23" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">relationship</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-24" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">race</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-25" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">sex</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-26" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_gain</span><span class="crayon-o">:</span> <span class="crayon-k ">float</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-27" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_loss</span><span class="crayon-o">:</span> <span class="crayon-k ">float</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-28" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">hours_week</span><span class="crayon-o">:</span> <span class="crayon-k ">float</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-29" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">native_country</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-30" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">label</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-31" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-32" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">feature </span><span class="crayon-r">import</span> <span class="crayon-i">StringIndexer</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-33" class="crayon-line"><span class="crayon-c">#stringIndexer = StringIndexer(inputCol="label", outputCol="newlabel")</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-34" class="crayon-line crayon-striped-line"><span class="crayon-c">#model = stringIndexer.fit(df)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-35" class="crayon-line"><span class="crayon-c">#df = model.transform(df)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733764748228214-36" class="crayon-line crayon-striped-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-13" class="ftwp-heading" style="text-align: justify;">Select columns</h5> <p style="text-align: justify;">Bạn c&oacute; thể chọn v&agrave; hiển thị c&aacute;c h&agrave;ng c&oacute; lựa chọn v&agrave; t&ecirc;n của c&aacute;c đặc trưng. Dưới đ&acirc;y, age v&agrave; fnlwgt được chọn.</p> <div id="urvanov-syntax-highlighter-610ff0b733769061395117" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733769061395117-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733769061395117-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-k ">select</span><span class="crayon-sy">(</span><span class="crayon-s">'age'</span><span class="crayon-sy">,</span><span class="crayon-s">'fnlwgt'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-cn">5</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73376c319684245-10">10</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">age</span><span class="crayon-o">|</span><span class="crayon-v">fnlwgt</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-cn">39</span><span class="crayon-o">|</span> <span class="crayon-cn">77516</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-5" class="crayon-line"><span class="crayon-o">|</span> <span class="crayon-cn">50</span><span class="crayon-o">|</span> <span class="crayon-cn">83311</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-cn">38</span><span class="crayon-o">|</span><span class="crayon-cn">215646</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-7" class="crayon-line"><span class="crayon-o">|</span> <span class="crayon-cn">53</span><span class="crayon-o">|</span><span class="crayon-cn">234721</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-cn">28</span><span class="crayon-o">|</span><span class="crayon-cn">338409</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-9" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73376c319684245-10" class="crayon-line crayon-striped-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">5</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-14" class="ftwp-heading" style="text-align: justify;">Count by group</h5> <p style="text-align: justify;">Nếu bạn muốn đếm số lần xuất hiện theo nh&oacute;m, bạn c&oacute; thể x&acirc;u chuỗi:</p> <ul style="text-align: justify;"> <li>groupBy()</li> <li>count()</li> </ul> <p style="text-align: justify;">Trong v&iacute; dụ PySpark b&ecirc;n dưới, bạn đếm số h&agrave;ng theo education level.</p> <div id="urvanov-syntax-highlighter-610ff0b73376f963903356" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73376f963903356-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73376f963903356-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">groupBy</span><span class="crayon-sy">(</span><span class="crayon-s">"education"</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">sort</span><span class="crayon-sy">(</span><span class="crayon-s">"count"</span><span class="crayon-sy">,</span><span class="crayon-v">ascending</span><span class="crayon-o">=</span><span class="crayon-t">True</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-18">18</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-19">19</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733772044533528-20">20</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">education</span><span class="crayon-o">|</span><span class="crayon-v">count</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">Preschool</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">51</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">1st</span><span class="crayon-o">-</span><span class="crayon-cn">4th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">168</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">5th</span><span class="crayon-o">-</span><span class="crayon-cn">6th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">333</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">Doctorate</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">413</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">12th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">433</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">9th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">514</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-v">Prof</span><span class="crayon-o">-</span><span class="crayon-v">school</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">576</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">7th</span><span class="crayon-o">-</span><span class="crayon-cn">8th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">646</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">10th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">933</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">Assoc</span><span class="crayon-o">-</span><span class="crayon-v">acdm</span><span class="crayon-o">|</span> <span class="crayon-cn">1067</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">11th</span><span class="crayon-o">|</span> <span class="crayon-cn">1175</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">Assoc</span><span class="crayon-o">-</span><span class="crayon-v">voc</span><span class="crayon-o">|</span> <span class="crayon-cn">1382</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Masters</span><span class="crayon-o">|</span> <span class="crayon-cn">1723</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">Bachelors</span><span class="crayon-o">|</span> <span class="crayon-cn">5355</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">Some</span><span class="crayon-o">-</span><span class="crayon-v">college</span><span class="crayon-o">|</span> <span class="crayon-cn">7291</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-19" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">HS</span><span class="crayon-o">-</span><span class="crayon-v">grad</span><span class="crayon-o">|</span><span class="crayon-cn">10501</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733772044533528-20" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-15" class="ftwp-heading" style="text-align: justify;">Describe the data</h5> <p style="text-align: justify;">Để nhận thống k&ecirc; t&oacute;m tắt về dữ liệu, bạn c&oacute; thể sử dụng description():</p> <ul style="text-align: justify;"> <li>count</li> <li>mean</li> <li>standarddeviation</li> <li>min</li> <li>max</li> </ul> <div id="urvanov-syntax-highlighter-610ff0b733775049147468" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733775049147468-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733775049147468-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">describe</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733778857995791-9">9</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">summary</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">age</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">workclass</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">fnlwgt</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">education</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">education_num</span><span class="crayon-o">|</span> <span class="crayon-v">marital</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">occupation</span><span class="crayon-o">|</span><span class="crayon-v">relationship</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">race</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">sex</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">capital_gain</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">capital_loss</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">hours_week</span><span class="crayon-o">|</span><span class="crayon-v">native_country</span><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">count</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span> <span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span><span class="crayon-cn">32561</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">mean</span><span class="crayon-o">|</span> <span class="crayon-cn">38.58164675532078</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-cn">189778.36651208502</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span> <span class="crayon-cn">10.0806793403151</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-cn">1077.6488437087312</span><span class="crayon-o">|</span> <span class="crayon-cn">87.303829734959</span><span class="crayon-o">|</span><span class="crayon-cn">40.437455852092995</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span> <span class="crayon-t">null</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-v">stddev</span><span class="crayon-o">|</span><span class="crayon-cn">13.640432553581356</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-cn">105549.97769702227</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-cn">2.572720332067397</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span> <span class="crayon-cn">7385.292084840354</span><span class="crayon-o">|</span><span class="crayon-cn">402.960218649002</span><span class="crayon-o">|</span><span class="crayon-cn">12.347428681731838</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-t">null</span><span class="crayon-o">|</span> <span class="crayon-t">null</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">min</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">17</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-sy">?</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">12285</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">10th</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">1</span><span class="crayon-o">|</span><span class="crayon-v">Divorced</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-sy">?</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Husband</span><span class="crayon-o">|</span><span class="crayon-v">Amer</span><span class="crayon-o">-</span><span class="crayon-v">Indian</span><span class="crayon-o">-</span><span class="crayon-v">Eskimo</span><span class="crayon-o">|</span><span class="crayon-v">Female</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">1</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-sy">?</span><span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">max</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">90</span><span class="crayon-o">|</span><span class="crayon-v">Without</span><span class="crayon-o">-</span><span class="crayon-v">pay</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">1484705</span><span class="crayon-o">|</span><span class="crayon-v">Some</span><span class="crayon-o">-</span><span class="crayon-v">college</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">16</span><span class="crayon-o">|</span> <span class="crayon-v">Widowed</span><span class="crayon-o">|</span><span class="crayon-v">Transport</span><span class="crayon-o">-</span><span class="crayon-v">moving</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Wife</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">White</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">Male</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">99999</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">4356</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">99</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Yugoslavia</span><span class="crayon-o">|</span> <span class="crayon-o">&gt;</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733778857995791-9" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Nếu bạn muốn thống k&ecirc; t&oacute;m tắt chỉ của một cột, h&atilde;y th&ecirc;m t&ecirc;n của cột v&agrave;o b&ecirc;n trong description().</p> <div id="urvanov-syntax-highlighter-610ff0b73377d322545531" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73377d322545531-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73377d322545531-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">describe</span><span class="crayon-sy">(</span><span class="crayon-s">'capital_gain'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733780627301515-9">9</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">summary</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">capital_gain</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">count</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32561</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">mean</span><span class="crayon-o">|</span><span class="crayon-cn">1077.6488437087312</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-v">stddev</span><span class="crayon-o">|</span> <span class="crayon-cn">7385.292084840354</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">min</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">max</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">99999</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733780627301515-9" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-16" class="ftwp-heading" style="text-align: justify;">Crosstab computation</h5> <p style="text-align: justify;">Trong một số trường hợp, c&oacute; thể th&uacute; vị khi xem c&aacute;c thống k&ecirc; m&ocirc; tả giữa hai cột theo cặp. V&iacute; dụ: bạn c&oacute; thể đếm số người c&oacute; thu nhập dưới hoặc tr&ecirc;n 50k theo tr&igrave;nh độ học vấn. Thao t&aacute;c n&agrave;y được gọi l&agrave; crosstab.</p> <div id="urvanov-syntax-highlighter-610ff0b733784505094103" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733784505094103-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733784505094103-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">crosstab</span><span class="crayon-sy">(</span><span class="crayon-s">'age'</span><span class="crayon-sy">,</span> <span class="crayon-s">'label'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">sort</span><span class="crayon-sy">(</span><span class="crayon-s">"age_label"</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-18">18</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-19">19</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-20">20</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-21">21</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-22">22</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-23">23</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-24">24</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733787265806518-25">25</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">age_label</span><span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span><span class="crayon-o">&gt;</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">17</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">395</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">18</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">550</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">19</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">710</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">2</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">20</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">753</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">21</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">717</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-cn">3</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">22</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">752</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">13</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">23</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">865</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">12</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">24</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">767</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">31</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">25</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">788</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">53</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">26</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">722</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">63</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">27</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">754</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">81</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">28</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">748</span><span class="crayon-o">|</span> <span class="crayon-cn">119</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">29</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">679</span><span class="crayon-o">|</span> <span class="crayon-cn">134</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">30</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">690</span><span class="crayon-o">|</span> <span class="crayon-cn">171</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">31</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">705</span><span class="crayon-o">|</span> <span class="crayon-cn">183</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-19" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">32</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">639</span><span class="crayon-o">|</span> <span class="crayon-cn">189</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-20" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">33</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">684</span><span class="crayon-o">|</span> <span class="crayon-cn">191</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-21" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">34</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">643</span><span class="crayon-o">|</span> <span class="crayon-cn">243</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-22" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">35</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">659</span><span class="crayon-o">|</span> <span class="crayon-cn">217</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-23" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">36</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">635</span><span class="crayon-o">|</span> <span class="crayon-cn">263</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-24" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733787265806518-25" class="crayon-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">20</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể thấy kh&ocirc;ng c&oacute; người n&agrave;o c&oacute; doanh thu tr&ecirc;n 50k khi họ c&ograve;n trẻ.</p> <h5 id="ftoc-heading-17" class="ftwp-heading" style="text-align: justify;">Drop column</h5> <p style="text-align: justify;">C&oacute; hai API trực quan để drop columns:</p> <ul style="text-align: justify;"> <li>drop(): Drop a column</li> <li>dropna(): Drop NA&rsquo;s</li> </ul> <p style="text-align: justify;">B&ecirc;n dưới bạn drop column&nbsp; education_num</p> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73378b062188915-16">16</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">drop</span><span class="crayon-sy">(</span><span class="crayon-s">'education_num'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-i">columns</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-2" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-3" class="crayon-line"><span class="crayon-sy">[</span><span class="crayon-s">'age'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-4" class="crayon-line crayon-striped-line"><span class="crayon-s">'workclass'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-5" class="crayon-line"><span class="crayon-s">'fnlwgt'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-6" class="crayon-line crayon-striped-line"><span class="crayon-s">'education'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-7" class="crayon-line"><span class="crayon-s">'marital'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-8" class="crayon-line crayon-striped-line"><span class="crayon-s">'occupation'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-9" class="crayon-line"><span class="crayon-s">'relationship'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-10" class="crayon-line crayon-striped-line"><span class="crayon-s">'race'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-11" class="crayon-line"><span class="crayon-s">'sex'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-12" class="crayon-line crayon-striped-line"><span class="crayon-s">'capital_gain'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-13" class="crayon-line"><span class="crayon-s">'capital_loss'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-14" class="crayon-line crayon-striped-line"><span class="crayon-s">'hours_week'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-15" class="crayon-line"><span class="crayon-s">'native_country'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73378b062188915-16" class="crayon-line crayon-striped-line"><span class="crayon-s">'label'</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-18" class="ftwp-heading" style="text-align: justify;">Filter data</h5> <p style="text-align: justify;">Bạn c&oacute; thể sử dụng filter () để &aacute;p dụng thống k&ecirc; m&ocirc; tả trong một tập hợp con dữ liệu. V&iacute; dụ: bạn c&oacute; thể đếm số người tr&ecirc;n 40 tuổi</p> <div id="urvanov-syntax-highlighter-610ff0b73378e900714457" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73378e900714457-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73378e900714457-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-k ">filter</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-v">age</span> <span class="crayon-o">&gt;</span> <span class="crayon-cn">40</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733791528542452" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733791528542452-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733791528542452-1" class="crayon-line"><span class="crayon-cn">13443</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-19" class="ftwp-heading" style="text-align: justify;">Thống k&ecirc; m&ocirc; tả theo nh&oacute;m</h5> <p style="text-align: justify;">Cuối c&ugrave;ng, bạn c&oacute; thể nh&oacute;m dữ liệu theo nh&oacute;m v&agrave; t&iacute;nh to&aacute;n c&aacute;c hoạt động thống k&ecirc; như gi&aacute; trị trung b&igrave;nh.</p> <div id="urvanov-syntax-highlighter-610ff0b733794943307430" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733794943307430-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733794943307430-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">groupby</span><span class="crayon-sy">(</span><span class="crayon-s">'marital'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">agg</span><span class="crayon-sy">(</span><span class="crayon-sy">{</span><span class="crayon-s">'capital_gain'</span><span class="crayon-o">:</span> <span class="crayon-s">'mean'</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733797641794652-11">11</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">marital</span><span class="crayon-o">|</span> <span class="crayon-e">avg</span><span class="crayon-sy">(</span><span class="crayon-v">capital_gain</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Separated</span><span class="crayon-o">|</span> <span class="crayon-cn">535.5687804878049</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Never</span><span class="crayon-o">-</span><span class="crayon-v">married</span><span class="crayon-o">|</span><span class="crayon-cn">376.58831788823363</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">-</span><span class="crayon-v">ab</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-o">|</span> <span class="crayon-cn">653.9832535885167</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Divorced</span><span class="crayon-o">|</span> <span class="crayon-cn">728.4148098131893</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Widowed</span><span class="crayon-o">|</span> <span class="crayon-cn">571.0715005035247</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">AF</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">|</span> <span class="crayon-cn">432.6521739130435</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">civ</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">|</span><span class="crayon-cn">1764.8595085470085</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733797641794652-11" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-20" class="ftwp-heading" style="text-align: justify;">Bước 2) Tiền xử l&yacute; dữ liệu</h4> <p style="text-align: justify;">Xử l&yacute; dữ liệu l&agrave; một bước quan trọng trong học m&aacute;y. Sau khi x&oacute;a dữ liệu r&aacute;c, bạn sẽ c&oacute; được một số th&ocirc;ng tin chi tiết quan trọng.</p> <p style="text-align: justify;">V&iacute; dụ, bạn biết rằng tuổi kh&ocirc;ng phải l&agrave; một h&agrave;m tuyến t&iacute;nh với thu nhập. Khi c&ograve;n trẻ, thu nhập của họ thường thấp hơn tuổi trung ni&ecirc;n. Sau khi nghỉ hưu, một hộ gia đ&igrave;nh sử dụng tiền tiết kiệm của họ, nghĩa l&agrave; thu nhập giảm. Để chụp mẫu n&agrave;y, bạn c&oacute; thể th&ecirc;m square v&agrave;o đặc trưng tuổi.</p> <p style="text-align: justify;"><strong>Add age square</strong></p> <p style="text-align: justify;">Để th&ecirc;m một đặc trưng mới, bạn cần:</p> <ol style="text-align: justify;"> <li>Chọn cột</li> <li>&Aacute;p dụng ph&eacute;p biến đổi v&agrave; th&ecirc;m n&oacute; v&agrave;o DataFrame</li> </ol> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-18">18</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-19">19</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-20">20</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-21">21</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-22">22</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-23">23</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-24">24</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-25">25</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337a8034364566-26">26</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-1" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">sql</span><span class="crayon-sy">.</span><span class="crayon-e">functions </span><span class="crayon-r">import</span> <span class="crayon-o">*</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-2" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-3" class="crayon-line"><span class="crayon-c"># 1 Select the column</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-4" class="crayon-line crayon-striped-line"><span class="crayon-v">age_square</span> <span class="crayon-o">=</span> <span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-k ">select</span><span class="crayon-sy">(</span><span class="crayon-e">col</span><span class="crayon-sy">(</span><span class="crayon-s">"age"</span><span class="crayon-sy">)</span><span class="crayon-o">*</span><span class="crayon-o">*</span><span class="crayon-cn">2</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-5" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-6" class="crayon-line crayon-striped-line"><span class="crayon-c"># 2 Apply the transformation and add it to the DataFrame</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-7" class="crayon-line"><span class="crayon-v">df</span> <span class="crayon-o">=</span> <span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">withColumn</span><span class="crayon-sy">(</span><span class="crayon-s">"age_square"</span><span class="crayon-sy">,</span> <span class="crayon-e">col</span><span class="crayon-sy">(</span><span class="crayon-s">"age"</span><span class="crayon-sy">)</span><span class="crayon-o">*</span><span class="crayon-o">*</span><span class="crayon-cn">2</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-8" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-9" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-10" class="crayon-line crayon-striped-line"><span class="crayon-v">root</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">age</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">workclass</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">fnlwgt</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">education_num</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">marital</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">occupation</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">relationship</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-19" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">race</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-20" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">sex</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-21" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_gain</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-22" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">capital_loss</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-23" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">hours_week</span><span class="crayon-o">:</span> <span class="crayon-e">integer</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-24" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">native_country</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-25" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">label</span><span class="crayon-o">:</span> <span class="crayon-k ">string</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337a8034364566-26" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">age_square</span><span class="crayon-o">:</span> <span class="crayon-e">double</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể thấy rằng age_square đ&atilde; được th&ecirc;m th&agrave;nh c&ocirc;ng v&agrave;o khung dữ liệu. Bạn c&oacute; thể thay đổi thứ tự của c&aacute;c biến với select. Dưới đ&acirc;y, bạn mang theo age_square ngay sau tuổi.</p> <div id="urvanov-syntax-highlighter-610ff0b7337c2876063338" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337c2876063338-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337c2876063338-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337c2876063338-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337c2876063338-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337c2876063338-5">5</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337c2876063338-1" class="crayon-line"><span class="crayon-v">COLUMNS</span> <span class="crayon-o">=</span> <span class="crayon-sy">[</span><span class="crayon-s">'age'</span><span class="crayon-sy">,</span> <span class="crayon-s">'age_square'</span><span class="crayon-sy">,</span> <span class="crayon-s">'workclass'</span><span class="crayon-sy">,</span> <span class="crayon-s">'fnlwgt'</span><span class="crayon-sy">,</span> <span class="crayon-s">'education'</span><span class="crayon-sy">,</span> <span class="crayon-s">'education_num'</span><span class="crayon-sy">,</span> <span class="crayon-s">'marital'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337c2876063338-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-s">'occupation'</span><span class="crayon-sy">,</span> <span class="crayon-s">'relationship'</span><span class="crayon-sy">,</span> <span class="crayon-s">'race'</span><span class="crayon-sy">,</span> <span class="crayon-s">'sex'</span><span class="crayon-sy">,</span> <span class="crayon-s">'capital_gain'</span><span class="crayon-sy">,</span> <span class="crayon-s">'capital_loss'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337c2876063338-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-s">'hours_week'</span><span class="crayon-sy">,</span> <span class="crayon-s">'native_country'</span><span class="crayon-sy">,</span> <span class="crayon-s">'label'</span><span class="crayon-sy">]</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337c2876063338-4" class="crayon-line crayon-striped-line"><span class="crayon-v">df</span> <span class="crayon-o">=</span> <span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-k ">select</span><span class="crayon-sy">(</span><span class="crayon-v">COLUMNS</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337c2876063338-5" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">first</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b7337c9200365330" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337c9200365330-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337c9200365330-1" class="crayon-line"><span class="crayon-e">Row</span><span class="crayon-sy">(</span><span class="crayon-v">age</span><span class="crayon-o">=</span><span class="crayon-cn">39</span><span class="crayon-sy">,</span> <span class="crayon-v">age_square</span><span class="crayon-o">=</span><span class="crayon-cn">1521.0</span><span class="crayon-sy">,</span> <span class="crayon-v">workclass</span><span class="crayon-o">=</span><span class="crayon-s">'State-gov'</span><span class="crayon-sy">,</span> <span class="crayon-v">fnlwgt</span><span class="crayon-o">=</span><span class="crayon-cn">77516</span><span class="crayon-sy">,</span> <span class="crayon-v">education</span><span class="crayon-o">=</span><span class="crayon-s">'Bachelors'</span><span class="crayon-sy">,</span> <span class="crayon-v">education_num</span><span class="crayon-o">=</span><span class="crayon-cn">13</span><span class="crayon-sy">,</span> <span class="crayon-v">marital</span><span class="crayon-o">=</span><span class="crayon-s">'Never-married'</span><span class="crayon-sy">,</span> <span class="crayon-v">occupation</span><span class="crayon-o">=</span><span class="crayon-s">'Adm-clerical'</span><span class="crayon-sy">,</span> <span class="crayon-v">relationship</span><span class="crayon-o">=</span><span class="crayon-s">'Not-in-family'</span><span class="crayon-sy">,</span> <span class="crayon-v">race</span><span class="crayon-o">=</span><span class="crayon-s">'White'</span><span class="crayon-sy">,</span> <span class="crayon-v">sex</span><span class="crayon-o">=</span><span class="crayon-s">'Male'</span><span class="crayon-sy">,</span> <span class="crayon-v">capital_gain</span><span class="crayon-o">=</span><span class="crayon-cn">2174</span><span class="crayon-sy">,</span> <span class="crayon-v">capital_loss</span><span class="crayon-o">=</span><span class="crayon-cn">0</span><span class="crayon-sy">,</span> <span class="crayon-v">hours_week</span><span class="crayon-o">=</span><span class="crayon-cn">40</span><span class="crayon-sy">,</span> <span class="crayon-v">native_country</span><span class="crayon-o">=</span><span class="crayon-s">'United-States'</span><span class="crayon-sy">,</span> <span class="crayon-v">label</span><span class="crayon-o">=</span><span class="crayon-s">'&lt;=50K'</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Loại trừ Holand-Netherlands</strong></p> <p style="text-align: justify;">Khi một nh&oacute;m trong một đặc trưng chỉ c&oacute; một dữ liệu, n&oacute; kh&ocirc;ng mang lại th&ocirc;ng tin g&igrave; cho m&ocirc; h&igrave;nh. Ngược lại, n&oacute; c&oacute; thể dẫn đến lỗi trong qu&aacute; tr&igrave;nh cross-validation.</p> <p style="text-align: justify;">H&atilde;y kiểm tra nguồn gốc của hộ.</p> <div id="urvanov-syntax-highlighter-610ff0b7337cd855236671" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337cd855236671-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337cd855236671-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337cd855236671-1" class="crayon-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-k ">filter</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-v">native_country</span> <span class="crayon-o">==</span> <span class="crayon-s">'Holand-Netherlands'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337cd855236671-2" class="crayon-line crayon-striped-line"><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-e">groupby</span><span class="crayon-sy">(</span><span class="crayon-s">'native_country'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">agg</span><span class="crayon-sy">(</span><span class="crayon-sy">{</span><span class="crayon-s">'native_country'</span><span class="crayon-o">:</span> <span class="crayon-s">'count'</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">sort</span><span class="crayon-sy">(</span><span class="crayon-e">asc</span><span class="crayon-sy">(</span><span class="crayon-s">"count(native_country)"</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-18">18</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-19">19</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-20">20</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-21">21</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-22">22</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-23">23</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-24">24</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d0635477623-25">25</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">native_country</span><span class="crayon-o">|</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-v">native_country</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-v">Holand</span><span class="crayon-o">-</span><span class="crayon-v">Netherlands</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">1</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Scotland</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">12</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Hungary</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">13</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Honduras</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">13</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">Outlying</span><span class="crayon-o">-</span><span class="crayon-v">US</span><span class="crayon-sy">(</span><span class="crayon-v">Guam</span><span class="crayon-o">-</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">14</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Yugoslavia</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">16</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Thailand</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">18</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Laos</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">18</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Cambodia</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">19</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Trinadad</span><span class="crayon-o">&amp;</span><span class="crayon-v">Tobago</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">19</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Hong</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">20</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Ireland</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">24</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Ecuador</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">28</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Greece</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">29</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">France</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">29</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-19" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Peru</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">31</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-20" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Nicaragua</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">34</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-21" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Portugal</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">37</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-22" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Iran</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">43</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-23" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Haiti</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">44</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-24" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337d0635477623-25" class="crayon-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">20</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Đặc trưng native_country chỉ c&oacute; một hộ gia đ&igrave;nh đến từ H&agrave; Lan. Bạn loại trừ n&oacute;.</p> <div id="urvanov-syntax-highlighter-610ff0b7337d4911075986" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d4911075986-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337d4911075986-1" class="crayon-line"><span class="crayon-v">df_remove</span> <span class="crayon-o">=</span> <span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-k ">filter</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">.</span><span class="crayon-v">native_country</span> <span class="crayon-o">!=</span> <span class="crayon-s">'Holand-Netherlands'</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-21" class="ftwp-heading" style="text-align: justify;">Bước 3) X&acirc;y dựng pipeline xử l&yacute; dữ liệu</h4> <p style="text-align: justify;">Tương tự như scikit-learn, Pyspark c&oacute; API pipeline.</p> <p style="text-align: justify;">Một pipeline dẫn rất thuận tiện để duy tr&igrave; cấu tr&uacute;c của dữ liệu. Bạn đẩy dữ liệu v&agrave;o pipeline. B&ecirc;n trong pipeline, c&aacute;c hoạt động kh&aacute;c nhau được thực hiện, đầu ra được sử dụng để cung cấp cho thuật to&aacute;n.</p> <p style="text-align: justify;">V&iacute; dụ: một ph&eacute;p biến đổi phổ qu&aacute;t trong học m&aacute;y bao gồm chuyển đổi một chuỗi th&agrave;nh một one hot encoder, tức l&agrave; một cột theo nh&oacute;m. One hot encoder thường l&agrave; một ma trận đầy c&aacute;c số 0.</p> <p style="text-align: justify;">C&aacute;c bước để biến đổi dữ liệu rất giống với scikit-learn. Bạn cần phải:</p> <ul style="text-align: justify;"> <li>Lập index chuỗi th&agrave;nh số</li> <li>Tạo một bộ one hot encoder</li> <li>Chuyển đổi dữ liệu</li> </ul> <p style="text-align: justify;">Hai API thực hiện c&ocirc;ng việc: StringIndexer, OneHotEncoder</p> <ol style="text-align: justify;"> <li>Trước hết, bạn chọn cột chuỗi để lập chỉ mục. InputCol l&agrave; t&ecirc;n của cột trong tập dữ liệu. OutputCol l&agrave; t&ecirc;n mới được đặt cho cột được chuyển đổi.<br /> <div id="urvanov-syntax-highlighter-610ff0b7337d8569852625" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337d8569852625-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337d8569852625-1" class="crayon-line"><span class="crayon-e">StringIndexer</span><span class="crayon-sy">(</span><span class="crayon-v">inputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclass"</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclass_encoded"</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> </li> <li>Điều chỉnh dữ liệu v&agrave; biến đổi n&oacute;<br /> <div id="urvanov-syntax-highlighter-610ff0b7337de839477734" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337de839477734-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337de839477734-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337de839477734-1" class="crayon-line"><span class="crayon-v">model</span> <span class="crayon-o">=</span> <span class="crayon-v">stringIndexer</span><span class="crayon-sy">.</span><span class="crayon-e">fit</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337de839477734-2" class="crayon-line crayon-striped-line"><span class="crayon-sy">`</span><span class="crayon-v">indexed</span> <span class="crayon-o">=</span> <span class="crayon-v">model</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">)</span><span class="crayon-sy">`</span><span class="crayon-sy">`</span></div> </div> </td> </tr> </tbody> </table> </div> </div> </li> <li>Tạo c&aacute;c cột news dựa tr&ecirc;n nh&oacute;m. V&iacute; dụ: nếu c&oacute; 10 nh&oacute;m trong đặc trưng, ma trận mới sẽ c&oacute; 10 cột, mỗi nh&oacute;m một cột.<br /> <div id="urvanov-syntax-highlighter-610ff0b7337e1922462731" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e1922462731-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337e1922462731-1" class="crayon-line"><span class="crayon-e">OneHotEncoder</span><span class="crayon-sy">(</span><span class="crayon-v">dropLast</span><span class="crayon-o">=</span><span class="crayon-t">False</span><span class="crayon-sy">,</span> <span class="crayon-v">inputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclassencoded"</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclassvec"</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <br /> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e4230365442-9">9</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-1" class="crayon-line"><span class="crayon-c">### Example encoder</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">feature </span><span class="crayon-r">import</span> <span class="crayon-v">StringIndexer</span><span class="crayon-sy">,</span> <span class="crayon-v">OneHotEncoder</span><span class="crayon-sy">,</span> <span class="crayon-e">VectorAssembler</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-4" class="crayon-line crayon-striped-line"><span class="crayon-v">stringIndexer</span> <span class="crayon-o">=</span> <span class="crayon-e">StringIndexer</span><span class="crayon-sy">(</span><span class="crayon-v">inputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclass"</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclass_encoded"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-5" class="crayon-line"><span class="crayon-v">model</span> <span class="crayon-o">=</span> <span class="crayon-v">stringIndexer</span><span class="crayon-sy">.</span><span class="crayon-e">fit</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-6" class="crayon-line crayon-striped-line"><span class="crayon-v">indexed</span> <span class="crayon-o">=</span> <span class="crayon-v">model</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-v">df</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-7" class="crayon-line"><span class="crayon-v">encoder</span> <span class="crayon-o">=</span> <span class="crayon-e">OneHotEncoder</span><span class="crayon-sy">(</span><span class="crayon-v">dropLast</span><span class="crayon-o">=</span><span class="crayon-t">False</span><span class="crayon-sy">,</span> <span class="crayon-v">inputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclass_encoded"</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-s">"workclass_vec"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-8" class="crayon-line crayon-striped-line"><span class="crayon-v">encoded</span> <span class="crayon-o">=</span> <span class="crayon-v">encoder</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-v">indexed</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e4230365442-9" class="crayon-line"><span class="crayon-v">encoded</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-cn">2</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <br /> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337e7805022826-7">7</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">age</span><span class="crayon-o">|</span><span class="crayon-v">age_square</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">workclass</span><span class="crayon-o">|</span><span class="crayon-v">fnlwgt</span><span class="crayon-o">|</span><span class="crayon-v">education</span><span class="crayon-o">|</span><span class="crayon-v">education_num</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">marital</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">occupation</span><span class="crayon-o">|</span> <span class="crayon-v">relationship</span><span class="crayon-o">|</span> <span class="crayon-v">race</span><span class="crayon-o">|</span> <span class="crayon-v">sex</span><span class="crayon-o">|</span><span class="crayon-v">capital_gain</span><span class="crayon-o">|</span><span class="crayon-v">capital_loss</span><span class="crayon-o">|</span><span class="crayon-v">hours_week</span><span class="crayon-o">|</span><span class="crayon-v">native_country</span><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span><span class="crayon-v">workclass_encoded</span><span class="crayon-o">|</span><span class="crayon-v">workclass_vec</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span> <span class="crayon-cn">39</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">1521.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">State</span><span class="crayon-o">-</span><span class="crayon-v">gov</span><span class="crayon-o">|</span> <span class="crayon-cn">77516</span><span class="crayon-o">|</span><span class="crayon-v">Bachelors</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">13</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">Never</span><span class="crayon-o">-</span><span class="crayon-v">married</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp; </span><span class="crayon-v">Adm</span><span class="crayon-o">-</span><span class="crayon-v">clerical</span><span class="crayon-o">|</span><span class="crayon-st">Not</span><span class="crayon-o">-</span><span class="crayon-st">in</span><span class="crayon-o">-</span><span class="crayon-v">family</span><span class="crayon-o">|</span><span class="crayon-v">White</span><span class="crayon-o">|</span><span class="crayon-v">Male</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">2174</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">40</span><span class="crayon-o">|</span> <span class="crayon-v">United</span><span class="crayon-o">-</span><span class="crayon-v">States</span><span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">4.0</span><span class="crayon-o">|</span><span class="crayon-sy">(</span><span class="crayon-cn">9</span><span class="crayon-sy">,</span><span class="crayon-sy">[</span><span class="crayon-cn">4</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span><span class="crayon-sy">[</span><span class="crayon-cn">1.0</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-5" class="crayon-line"><span class="crayon-o">|</span> <span class="crayon-cn">50</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">2500.0</span><span class="crayon-o">|</span><span class="crayon-r">Self</span><span class="crayon-o">-</span><span class="crayon-v">emp</span><span class="crayon-o">-</span><span class="crayon-st">not</span><span class="crayon-o">-</span><span class="crayon-v">inc</span><span class="crayon-o">|</span> <span class="crayon-cn">83311</span><span class="crayon-o">|</span><span class="crayon-v">Bachelors</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">13</span><span class="crayon-o">|</span><span class="crayon-v">Married</span><span class="crayon-o">-</span><span class="crayon-v">civ</span><span class="crayon-o">-</span><span class="crayon-v">spouse</span><span class="crayon-o">|</span><span class="crayon-v">Exec</span><span class="crayon-o">-</span><span class="crayon-v">managerial</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">Husband</span><span class="crayon-o">|</span><span class="crayon-v">White</span><span class="crayon-o">|</span><span class="crayon-v">Male</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">13</span><span class="crayon-o">|</span> <span class="crayon-v">United</span><span class="crayon-o">-</span><span class="crayon-v">States</span><span class="crayon-o">|</span><span class="crayon-o">&lt;=</span><span class="crayon-cn">50K</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">1.0</span><span class="crayon-o">|</span><span class="crayon-sy">(</span><span class="crayon-cn">9</span><span class="crayon-sy">,</span><span class="crayon-sy">[</span><span class="crayon-cn">1</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span><span class="crayon-sy">[</span><span class="crayon-cn">1.0</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-6" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337e7805022826-7" class="crayon-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">2</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> </li> </ol> <h5 id="ftoc-heading-22" class="ftwp-heading" style="text-align: justify;">X&acirc;y dựng pipeline</h5> <p style="text-align: justify;">Bạn sẽ x&acirc;y dựng một pipeline để chuyển đổi tất cả c&aacute;c đặc trưng ch&iacute;nh x&aacute;c v&agrave; th&ecirc;m ch&uacute;ng v&agrave;o tập dữ liệu cuối c&ugrave;ng. Pipeline sẽ c&oacute; bốn hoạt động, nhưng h&atilde;y thoải m&aacute;i th&ecirc;m bao nhi&ecirc;u hoạt động t&ugrave;y th&iacute;ch.</p> <ol style="text-align: justify;"> <li>Encode dữ liệu ph&acirc;n loại</li> <li>Lập Index label feature</li> <li>Th&ecirc;m biến li&ecirc;n tục</li> <li>Tập hợp c&aacute;c bước.</li> </ol> <p style="text-align: justify;">Mỗi bước được lưu trữ trong một danh s&aacute;ch c&oacute; t&ecirc;n c&aacute;c giai đoạn. Danh s&aacute;ch n&agrave;y sẽ cho VectorAssembler biết thao t&aacute;c n&agrave;o cần thực hiện b&ecirc;n trong pipeline.</p> <p style="text-align: justify;"><strong>M&atilde; h&oacute;a dữ liệu ph&acirc;n loại</strong></p> <p style="text-align: justify;">Bước n&agrave;y cũng giống như v&iacute; dụ tr&ecirc;n, ngoại trừ việc bạn lặp lại tất cả c&aacute;c đặc trưng ph&acirc;n loại.</p> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337ec364520384-9">9</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-1" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-e">ml </span><span class="crayon-r">import</span> <span class="crayon-e">Pipeline</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">feature </span><span class="crayon-r">import</span> <span class="crayon-e">OneHotEncoderEstimator</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-3" class="crayon-line"><span class="crayon-v">CATE_FEATURES</span> <span class="crayon-o">=</span> <span class="crayon-sy">[</span><span class="crayon-s">'workclass'</span><span class="crayon-sy">,</span> <span class="crayon-s">'education'</span><span class="crayon-sy">,</span> <span class="crayon-s">'marital'</span><span class="crayon-sy">,</span> <span class="crayon-s">'occupation'</span><span class="crayon-sy">,</span> <span class="crayon-s">'relationship'</span><span class="crayon-sy">,</span> <span class="crayon-s">'race'</span><span class="crayon-sy">,</span> <span class="crayon-s">'sex'</span><span class="crayon-sy">,</span> <span class="crayon-s">'native_country'</span><span class="crayon-sy">]</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-4" class="crayon-line crayon-striped-line"><span class="crayon-v">stages</span> <span class="crayon-o">=</span> <span class="crayon-sy">[</span><span class="crayon-sy">]</span> <span class="crayon-c"># stages in our Pipeline</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-5" class="crayon-line"><span class="crayon-st">for</span> <span class="crayon-e">categoricalCol </span><span class="crayon-st">in</span> <span class="crayon-v">CATE_FEATURES</span><span class="crayon-o">:</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">stringIndexer</span> <span class="crayon-o">=</span> <span class="crayon-e">StringIndexer</span><span class="crayon-sy">(</span><span class="crayon-v">inputCol</span><span class="crayon-o">=</span><span class="crayon-v">categoricalCol</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-v">categoricalCol</span> <span class="crayon-o">+</span> <span class="crayon-s">"Index"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">encoder</span> <span class="crayon-o">=</span> <span class="crayon-e">OneHotEncoderEstimator</span><span class="crayon-sy">(</span><span class="crayon-v">inputCols</span><span class="crayon-o">=</span><span class="crayon-sy">[</span><span class="crayon-v">stringIndexer</span><span class="crayon-sy">.</span><span class="crayon-e">getOutputCol</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">outputCols</span><span class="crayon-o">=</span><span class="crayon-sy">[</span><span class="crayon-v">categoricalCol</span> <span class="crayon-o">+</span> <span class="crayon-s">"classVec"</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337ec364520384-9" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">stages</span> <span class="crayon-o">+=</span> <span class="crayon-sy">[</span><span class="crayon-v">stringIndexer</span><span class="crayon-sy">,</span> <span class="crayon-v">encoder</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Lập index label feature</strong></p> <p style="text-align: justify;">Spark, giống như nhiều thư viện kh&aacute;c, kh&ocirc;ng chấp nhận c&aacute;c gi&aacute; trị chuỗi cho nh&atilde;n. Bạn chuyển đổi đặc trưng nh&atilde;n với StringIndexer v&agrave; th&ecirc;m n&oacute; v&agrave;o c&aacute;c giai đoạn danh s&aacute;ch.</p> <div id="urvanov-syntax-highlighter-610ff0b7337f1204598441" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337f1204598441-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337f1204598441-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337f1204598441-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337f1204598441-1" class="crayon-line"><span class="crayon-c"># Convert label into label indices using the StringIndexer</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337f1204598441-2" class="crayon-line crayon-striped-line"><span class="crayon-v">label_stringIdx</span> <span class="crayon-o">=</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-e">StringIndexer</span><span class="crayon-sy">(</span><span class="crayon-v">inputCol</span><span class="crayon-o">=</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-s">"newlabel"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337f1204598441-3" class="crayon-line"><span class="crayon-v">stages</span> <span class="crayon-o">+=</span> <span class="crayon-sy">[</span><span class="crayon-v">label_stringIdx</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Th&ecirc;m biến li&ecirc;n tục</strong></p> <p style="text-align: justify;">InputCols của VectorAssembler l&agrave; một danh s&aacute;ch c&aacute;c cột. Bạn c&oacute; thể tạo một danh s&aacute;ch mới chứa tất cả c&aacute;c cột mới.</p> <div id="urvanov-syntax-highlighter-610ff0b7337f7911004166" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337f7911004166-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337f7911004166-1" class="crayon-line"><span class="crayon-v">assemblerInputs</span> <span class="crayon-o">=</span> <span class="crayon-sy">[</span><span class="crayon-v">c</span> <span class="crayon-o">+</span> <span class="crayon-s">"classVec"</span> <span class="crayon-st">for</span> <span class="crayon-i">c</span> <span class="crayon-st">in</span> <span class="crayon-v">CATE_FEATURES</span><span class="crayon-sy">]</span> <span class="crayon-o">+</span> <span class="crayon-v">CONTI_FEATURES</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Tập hợp c&aacute;c bước</strong></p> <p style="text-align: justify;">Cuối c&ugrave;ng, bạn vượt qua tất cả c&aacute;c bước trong VectorAssembler</p> <div id="urvanov-syntax-highlighter-610ff0b7337fa294366448" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337fa294366448-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337fa294366448-1" class="crayon-line"><span class="crayon-v">assembler</span> <span class="crayon-o">=</span> <span class="crayon-e">VectorAssembler</span><span class="crayon-sy">(</span><span class="crayon-v">inputCols</span><span class="crayon-o">=</span><span class="crayon-v">assemblerInputs</span><span class="crayon-sy">,</span> <span class="crayon-v">outputCol</span><span class="crayon-o">=</span><span class="crayon-s">"features"</span><span class="crayon-sy">)</span><span class="crayon-v">stages</span> <span class="crayon-o">+=</span> <span class="crayon-sy">[</span><span class="crayon-v">assembler</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">B&acirc;y giờ tất cả c&aacute;c bước đ&atilde; sẵn s&agrave;ng, bạn đẩy dữ liệu v&agrave;o pipeline.</p> <div id="urvanov-syntax-highlighter-610ff0b7337fd199661433" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337fd199661433-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337fd199661433-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b7337fd199661433-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b7337fd199661433-4">4</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b7337fd199661433-1" class="crayon-line"><span class="crayon-c"># Create a Pipeline.</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337fd199661433-2" class="crayon-line crayon-striped-line"><span class="crayon-v">pipeline</span> <span class="crayon-o">=</span> <span class="crayon-e">Pipeline</span><span class="crayon-sy">(</span><span class="crayon-v">stages</span><span class="crayon-o">=</span><span class="crayon-v">stages</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337fd199661433-3" class="crayon-line"><span class="crayon-v">pipelineModel</span> <span class="crayon-o">=</span> <span class="crayon-v">pipeline</span><span class="crayon-sy">.</span><span class="crayon-e">fit</span><span class="crayon-sy">(</span><span class="crayon-v">df_remove</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b7337fd199661433-4" class="crayon-line crayon-striped-line"><span class="crayon-v">model</span> <span class="crayon-o">=</span> <span class="crayon-v">pipelineModel</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-v">df_remove</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Nếu bạn kiểm tra tập dữ liệu mới, bạn c&oacute; thể thấy rằng n&oacute; chứa tất cả c&aacute;c đặc trưng, được chuyển đổi v&agrave; chưa được chuyển đổi. Bạn chỉ quan t&acirc;m đến nh&atilde;n mới v&agrave; c&aacute;c đặc trưng.</p> <div id="urvanov-syntax-highlighter-610ff0b733800961492456" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733800961492456-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733800961492456-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733800961492456-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733800961492456-1" class="crayon-line"><span class="crayon-v">model</span><span class="crayon-sy">.</span><span class="crayon-e">take</span><span class="crayon-sy">(</span><span class="crayon-cn">1</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733800961492456-2" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733800961492456-3" class="crayon-line"><span class="crayon-sy">[</span><span class="crayon-e">Row</span><span class="crayon-sy">(</span><span class="crayon-v">age</span><span class="crayon-o">=</span><span class="crayon-cn">39</span><span class="crayon-sy">,</span> <span class="crayon-v">age_square</span><span class="crayon-o">=</span><span class="crayon-cn">1521.0</span><span class="crayon-sy">,</span> <span class="crayon-v">workclass</span><span class="crayon-o">=</span><span class="crayon-s">'State-gov'</span><span class="crayon-sy">,</span> <span class="crayon-v">fnlwgt</span><span class="crayon-o">=</span><span class="crayon-cn">77516</span><span class="crayon-sy">,</span> <span class="crayon-v">education</span><span class="crayon-o">=</span><span class="crayon-s">'Bachelors'</span><span class="crayon-sy">,</span> <span class="crayon-v">education_num</span><span class="crayon-o">=</span><span class="crayon-cn">13</span><span class="crayon-sy">,</span> <span class="crayon-v">marital</span><span class="crayon-o">=</span><span class="crayon-s">'Never-married'</span><span class="crayon-sy">,</span> <span class="crayon-v">occupation</span><span class="crayon-o">=</span><span class="crayon-s">'Adm-clerical'</span><span class="crayon-sy">,</span> <span class="crayon-v">relationship</span><span class="crayon-o">=</span><span class="crayon-s">'Not-in-family'</span><span class="crayon-sy">,</span> <span class="crayon-v">race</span><span class="crayon-o">=</span><span class="crayon-s">'White'</span><span class="crayon-sy">,</span> <span class="crayon-v">sex</span><span class="crayon-o">=</span><span class="crayon-s">'Male'</span><span class="crayon-sy">,</span> <span class="crayon-v">capital_gain</span><span class="crayon-o">=</span><span class="crayon-cn">2174</span><span class="crayon-sy">,</span> <span class="crayon-v">capital_loss</span><span class="crayon-o">=</span><span class="crayon-cn">0</span><span class="crayon-sy">,</span> <span class="crayon-v">hours_week</span><span class="crayon-o">=</span><span class="crayon-cn">40</span><span class="crayon-sy">,</span> <span class="crayon-v">native_country</span><span class="crayon-o">=</span><span class="crayon-s">'United-States'</span><span class="crayon-sy">,</span> <span class="crayon-v">label</span><span class="crayon-o">=</span><span class="crayon-s">'&lt;=50K'</span><span class="crayon-sy">,</span> <span class="crayon-v">workclassIndex</span><span class="crayon-o">=</span><span class="crayon-cn">4.0</span><span class="crayon-sy">,</span> <span class="crayon-v">workclassclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">8</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">4</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">educationIndex</span><span class="crayon-o">=</span><span class="crayon-cn">2.0</span><span class="crayon-sy">,</span> <span class="crayon-v">educationclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">15</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">2</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">maritalIndex</span><span class="crayon-o">=</span><span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-v">maritalclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">6</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">1</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">occupationIndex</span><span class="crayon-o">=</span><span class="crayon-cn">3.0</span><span class="crayon-sy">,</span> <span class="crayon-v">occupationclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">14</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">3</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">relationshipIndex</span><span class="crayon-o">=</span><span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-v">relationshipclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">5</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">1</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">raceIndex</span><span class="crayon-o">=</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span> <span class="crayon-v">raceclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">4</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">0</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">sexIndex</span><span class="crayon-o">=</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span> <span class="crayon-v">sexclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">1</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">0</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">native_countryIndex</span><span class="crayon-o">=</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span> <span class="crayon-v">native_countryclassVec</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">40</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">0</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">,</span> <span class="crayon-v">newlabel</span><span class="crayon-o">=</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span> <span class="crayon-v">features</span><span class="crayon-o">=</span><span class="crayon-e">SparseVector</span><span class="crayon-sy">(</span><span class="crayon-cn">99</span><span class="crayon-sy">,</span> <span class="crayon-sy">{</span><span class="crayon-cn">4</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">10</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">24</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">32</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">44</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">48</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">52</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">53</span><span class="crayon-o">:</span> <span class="crayon-cn">1.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">93</span><span class="crayon-o">:</span> <span class="crayon-cn">39.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">94</span><span class="crayon-o">:</span> <span class="crayon-cn">77516.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">95</span><span class="crayon-o">:</span> <span class="crayon-cn">2174.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">96</span><span class="crayon-o">:</span> <span class="crayon-cn">13.0</span><span class="crayon-sy">,</span> <span class="crayon-cn">98</span><span class="crayon-o">:</span> <span class="crayon-cn">40.0</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span><span class="crayon-sy">]</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-23" class="ftwp-heading" style="text-align: justify;">Bước 4) X&acirc;y dựng bộ ph&acirc;n loại: logistic</h4> <p style="text-align: justify;">Để t&iacute;nh to&aacute;n nhanh hơn, bạn chuyển đổi m&ocirc; h&igrave;nh th&agrave;nh DataFrame.</p> <p style="text-align: justify;">Bạn cần chọn nh&atilde;n mới v&agrave; c&aacute;c đặc trưng từ m&ocirc; h&igrave;nh bằng c&aacute;ch sử dụng map.</p> <div id="urvanov-syntax-highlighter-610ff0b733804366405574" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733804366405574-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733804366405574-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733804366405574-1" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">linalg </span><span class="crayon-r">import</span> <span class="crayon-e">DenseVector</span></div> <div id="urvanov-syntax-highlighter-610ff0b733804366405574-2" class="crayon-line crayon-striped-line"><span class="crayon-v">input_data</span> <span class="crayon-o">=</span> <span class="crayon-v">model</span><span class="crayon-sy">.</span><span class="crayon-v">rdd</span><span class="crayon-sy">.</span><span class="crayon-k ">map</span><span class="crayon-sy">(</span><span class="crayon-r">lambda</span> <span class="crayon-v">x</span><span class="crayon-o">:</span> <span class="crayon-sy">(</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-s">"newlabel"</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span> <span class="crayon-e">DenseVector</span><span class="crayon-sy">(</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-s">"features"</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn đ&atilde; sẵn s&agrave;ng tạo dữ liệu train dưới dạng DataFrame.&nbsp;Sử dụng sqlContext</p> <div id="urvanov-syntax-highlighter-610ff0b733808898069554" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733808898069554-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733808898069554-1" class="crayon-line"><span class="crayon-v">df_train</span> <span class="crayon-o">=</span> <span class="crayon-v">sqlContext</span><span class="crayon-sy">.</span><span class="crayon-e">createDataFrame</span><span class="crayon-sy">(</span><span class="crayon-v">input_data</span><span class="crayon-sy">,</span> <span class="crayon-sy">[</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span> <span class="crayon-s">"features"</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Kiểm tra h&agrave;ng thứ hai</p> <div id="urvanov-syntax-highlighter-610ff0b73380b022007901" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73380b022007901-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73380b022007901-1" class="crayon-line"><span class="crayon-v">df_train</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-cn">2</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73380d940326769-7">7</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">features</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-cn">1.0</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0</span><span class="crayon-sy">,</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-sy">.</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-6" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73380d940326769-7" class="crayon-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">2</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"><strong>Tạo train/test set</strong></p> <p style="text-align: justify;">Bạn chia tập dữ liệu 80/20 với randomSplit.</p> <div id="urvanov-syntax-highlighter-610ff0b733811116293522" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733811116293522-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733811116293522-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733811116293522-1" class="crayon-line"><span class="crayon-c"># Split the data into train and test sets</span></div> <div id="urvanov-syntax-highlighter-610ff0b733811116293522-2" class="crayon-line crayon-striped-line"><span class="crayon-v">train_data</span><span class="crayon-sy">,</span> <span class="crayon-v">test_data</span> <span class="crayon-o">=</span> <span class="crayon-v">df_train</span><span class="crayon-sy">.</span><span class="crayon-e">randomSplit</span><span class="crayon-sy">(</span><span class="crayon-sy">[</span><span class="crayon-sy">.</span><span class="crayon-cn">8</span><span class="crayon-sy">,</span><span class="crayon-sy">.</span><span class="crayon-cn">2</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span><span class="crayon-v">seed</span><span class="crayon-o">=</span><span class="crayon-cn">1234</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">H&atilde;y đếm xem c&oacute; bao nhi&ecirc;u người c&oacute; thu nhập dưới / tr&ecirc;n 50k trong cả tập huấn luyện v&agrave; kiểm tra.</p> <div id="urvanov-syntax-highlighter-610ff0b733818424888650" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733818424888650-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733818424888650-1" class="crayon-line"><span class="crayon-v">train_data</span><span class="crayon-sy">.</span><span class="crayon-e">groupby</span><span class="crayon-sy">(</span><span class="crayon-s">'label'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">agg</span><span class="crayon-sy">(</span><span class="crayon-sy">{</span><span class="crayon-s">'label'</span><span class="crayon-o">:</span> <span class="crayon-s">'count'</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73381b890762945-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73381b890762945-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73381b890762945-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73381b890762945-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73381b890762945-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73381b890762945-6">6</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-v">label</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">19698</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">1.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">6263</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73381b890762945-6" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73381e348499142" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73381e348499142-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73381e348499142-1" class="crayon-line"><span class="crayon-v">test_data</span><span class="crayon-sy">.</span><span class="crayon-e">groupby</span><span class="crayon-sy">(</span><span class="crayon-s">'label'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">agg</span><span class="crayon-sy">(</span><span class="crayon-sy">{</span><span class="crayon-s">'label'</span><span class="crayon-o">:</span> <span class="crayon-s">'count'</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733821120498642" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733821120498642-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733821120498642-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733821120498642-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733821120498642-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733821120498642-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733821120498642-6">6</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733821120498642-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733821120498642-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-v">label</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733821120498642-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733821120498642-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">5021</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733821120498642-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">1.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">1578</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733821120498642-6" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-24" class="ftwp-heading" style="text-align: justify;">X&acirc;y dựng bộ hồi quy logistic</h5> <p style="text-align: justify;">Cuối c&ugrave;ng nhưng kh&ocirc;ng k&eacute;m phần quan trọng, bạn c&oacute; thể x&acirc;y dựng bộ ph&acirc;n loại. Pyspark c&oacute; một API gọi l&agrave; LogisticRegression để thực hiện hồi quy logistic.</p> <p style="text-align: justify;">Bạn khởi tạo lr bằng c&aacute;ch chỉ ra cột nh&atilde;n v&agrave; c&aacute;c cột đặc trưng. Đặt tối đa 10 lần lặp v&agrave; th&ecirc;m th&ocirc;ng số ch&iacute;nh quy h&oacute;a với gi&aacute; trị 0,3. Lưu &yacute; rằng trong phần tiếp theo, bạn sẽ sử dụng x&aacute;c thực ch&eacute;o với lưới tham số để điều chỉnh m&ocirc; h&igrave;nh.</p> <div id="urvanov-syntax-highlighter-610ff0b733824367014931" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733824367014931-11">11</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-1" class="crayon-line"><span class="crayon-c"># Import `LinearRegression`</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">classification </span><span class="crayon-r">import</span> <span class="crayon-i">LogisticRegression</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-4" class="crayon-line crayon-striped-line"><span class="crayon-c"># Initialize `lr`</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-5" class="crayon-line"><span class="crayon-v">lr</span> <span class="crayon-o">=</span> <span class="crayon-e">LogisticRegression</span><span class="crayon-sy">(</span><span class="crayon-v">labelCol</span><span class="crayon-o">=</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">featuresCol</span><span class="crayon-o">=</span><span class="crayon-s">"features"</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">maxIter</span><span class="crayon-o">=</span><span class="crayon-cn">10</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-8" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">regParam</span><span class="crayon-o">=</span><span class="crayon-cn">0.3</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-9" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-10" class="crayon-line crayon-striped-line"><span class="crayon-c"># Fit the data to the model</span></div> <div id="urvanov-syntax-highlighter-610ff0b733824367014931-11" class="crayon-line"><span class="crayon-v">linearModel</span> <span class="crayon-o">=</span> <span class="crayon-v">lr</span><span class="crayon-sy">.</span><span class="crayon-e">fit</span><span class="crayon-sy">(</span><span class="crayon-v">train_data</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;"># Bạn c&oacute; thể xem c&aacute;c hệ số từ hồi quy</p> <div id="urvanov-syntax-highlighter-610ff0b733827441827468" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733827441827468-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733827441827468-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733827441827468-3">3</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733827441827468-1" class="crayon-line"><span class="crayon-c"># Print the coefficients and intercept for logistic regression</span></div> <div id="urvanov-syntax-highlighter-610ff0b733827441827468-2" class="crayon-line crayon-striped-line"><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-s">"Coefficients: "</span> <span class="crayon-o">+</span> <span class="crayon-k ">str</span><span class="crayon-sy">(</span><span class="crayon-v">linearModel</span><span class="crayon-sy">.</span><span class="crayon-v">coefficients</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733827441827468-3" class="crayon-line"><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-s">"Intercept: "</span> <span class="crayon-o">+</span> <span class="crayon-k ">str</span><span class="crayon-sy">(</span><span class="crayon-v">linearModel</span><span class="crayon-sy">.</span><span class="crayon-v">intercept</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73382a435562337" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73382a435562337-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73382a435562337-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73382a435562337-1" class="crayon-line"><span class="crayon-v">Coefficients</span><span class="crayon-o">:</span> <span class="crayon-sy">[</span><span class="crayon-o">-</span><span class="crayon-cn">0.0678914665262</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.153425526813</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0706009536407</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.164057586562</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.120655298528</span><span class="crayon-sy">,</span><span class="crayon-cn">0.162922330862</span><span class="crayon-sy">,</span><span class="crayon-cn">0.149176870438</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.626836362611</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.193483661541</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0782269980838</span><span class="crayon-sy">,</span><span class="crayon-cn">0.222667203836</span><span class="crayon-sy">,</span><span class="crayon-cn">0.399571096381</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0222024341804</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.311925857859</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0434497788688</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.306007744328</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.41318209688</span><span class="crayon-sy">,</span><span class="crayon-cn">0.547937504247</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.395837350854</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.23166535958</span><span class="crayon-sy">,</span><span class="crayon-cn">0.618743906733</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.344088614546</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.385266881369</span><span class="crayon-sy">,</span><span class="crayon-cn">0.317324463006</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.350518889186</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.201335923138</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.232878560088</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.13349278865</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.119760542498</span><span class="crayon-sy">,</span><span class="crayon-cn">0.17500602491</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0480968101118</span><span class="crayon-sy">,</span><span class="crayon-cn">0.288484253943</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.116314616745</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0524163478063</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.300952624551</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.22046421474</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.16557996579</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.114676231939</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.311966431453</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.344226119233</span><span class="crayon-sy">,</span><span class="crayon-cn">0.105530129507</span><span class="crayon-sy">,</span><span class="crayon-cn">0.152243047814</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.292774545497</span><span class="crayon-sy">,</span><span class="crayon-cn">0.263628334433</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.199951374076</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.30329422583</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.231087515178</span><span class="crayon-sy">,</span><span class="crayon-cn">0.418918551</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0565930184279</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.177818073048</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.0733236680663</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.267972912252</span><span class="crayon-sy">,</span><span class="crayon-cn">0.168491215697</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.12181255723</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.385648075442</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.202101794517</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0469791640782</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.00842850210625</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.00373211448629</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.259296141281</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.309896554133</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.168434409756</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.11048086026</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0280647963877</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.204187030092</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.414392623536</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.252806580669</span><span class="crayon-sy">,</span><span class="crayon-cn">0.143366465705</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.516359222663</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.435627370849</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.301949286524</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0878249035894</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.210951740965</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.621417928742</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.099445190784</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.232671473401</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.1077745606</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.360429419703</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.420362959052</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.379729467809</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.395186242741</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0826401853838</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.280251589972</span><span class="crayon-sy">,</span><span class="crayon-cn">0.187313505214</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.20295228799</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.431177064626</span><span class="crayon-sy">,</span><span class="crayon-cn">0.149759018379</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.107114299614</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.319314858424</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0028450133235</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.651220387649</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.327918792207</span><span class="crayon-sy">,</span><span class="crayon-o">-</span><span class="crayon-cn">0.143659581445</span><span class="crayon-sy">,</span><span class="crayon-cn">0.00691075160413</span><span class="crayon-sy">,</span><span class="crayon-cn">8.38517628783e</span><span class="crayon-o">-</span><span class="crayon-cn">08</span><span class="crayon-sy">,</span><span class="crayon-cn">2.18856717378e</span><span class="crayon-o">-</span><span class="crayon-cn">05</span><span class="crayon-sy">,</span><span class="crayon-cn">0.0266701216268</span><span class="crayon-sy">,</span><span class="crayon-cn">0.000231075966823</span><span class="crayon-sy">,</span><span class="crayon-cn">0.00893832698698</span><span class="crayon-sy">]</span></div> <div id="urvanov-syntax-highlighter-610ff0b73382a435562337-2" class="crayon-line crayon-striped-line"><span class="crayon-v">Intercept</span><span class="crayon-o">:</span> <span class="crayon-o">-</span><span class="crayon-cn">1.9884177974805692</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-25" class="ftwp-heading" style="text-align: justify;">Bước 5) Đ&agrave;o tạo v&agrave; đ&aacute;nh gi&aacute; m&ocirc; h&igrave;nh</h4> <p style="text-align: justify;">Để tạo dự đo&aacute;n cho bộ thử Bạn cần phải xem chỉ số độ ch&iacute;nh x&aacute;c để xem m&ocirc; h&igrave;nh hoạt động tốt (hoặc xấu) như thế n&agrave;o.nghiệm của bạn. Bạn c&oacute; thể sử dụng linearModel với transform() tr&ecirc;n test_data.</p> <div id="urvanov-syntax-highlighter-610ff0b73382f155201332" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73382f155201332-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73382f155201332-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73382f155201332-1" class="crayon-line"><span class="crayon-c"># Make predictions on test data using the transform() method.</span></div> <div id="urvanov-syntax-highlighter-610ff0b73382f155201332-2" class="crayon-line crayon-striped-line"><span class="crayon-v">predictions</span> <span class="crayon-o">=</span> <span class="crayon-v">linearModel</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-v">test_data</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể in c&aacute;c phần tử trong dự đo&aacute;n</p> <div id="urvanov-syntax-highlighter-610ff0b733835780500099" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733835780500099-7">7</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-1" class="crayon-line"><span class="crayon-v">predictions</span><span class="crayon-sy">.</span><span class="crayon-e">printSchema</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-2" class="crayon-line crayon-striped-line"><span class="crayon-v">root</span></div> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-3" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">label</span><span class="crayon-o">:</span> <span class="crayon-e">double</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">features</span><span class="crayon-o">:</span> <span class="crayon-e">vector</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">rawPrediction</span><span class="crayon-o">:</span> <span class="crayon-e">vector</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">probability</span><span class="crayon-o">:</span> <span class="crayon-e">vector</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">true</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733835780500099-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-o">--</span> <span class="crayon-v">prediction</span><span class="crayon-o">:</span> <span class="crayon-e">double</span> <span class="crayon-sy">(</span><span class="crayon-v">nullable</span> <span class="crayon-o">=</span> <span class="crayon-t">false</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn quan t&acirc;m đến nh&atilde;n, dự đo&aacute;n v&agrave; x&aacute;c suất</p> <div id="urvanov-syntax-highlighter-610ff0b733838735841494" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733838735841494-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733838735841494-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733838735841494-1" class="crayon-line"><span class="crayon-v">selected</span> <span class="crayon-o">=</span> <span class="crayon-v">predictions</span><span class="crayon-sy">.</span><span class="crayon-k ">select</span><span class="crayon-sy">(</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span> <span class="crayon-s">"prediction"</span><span class="crayon-sy">,</span> <span class="crayon-s">"probability"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733838735841494-2" class="crayon-line crayon-striped-line"><span class="crayon-v">selected</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-cn">20</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-14">14</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-15">15</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-16">16</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-17">17</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-18">18</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-19">19</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-20">20</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-21">21</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-22">22</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-23">23</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-24">24</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73383b497053896-25">25</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span><span class="crayon-v">prediction</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-v">probability</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.91560704124179...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.92812140213994...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-6" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.92161406774159...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-7" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.96222760777142...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-8" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.66363283056957...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-9" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.65571324475477...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-10" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.73053376932829...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-11" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">1.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.31265053873570...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-12" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.80005907577390...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-13" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.76482251301640...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-14" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.84447301189069...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-15" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.75691912026619...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-16" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.60902504096722...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-17" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.80799228385509...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-18" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.87704364852567...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-19" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.83817652582377...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-20" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.79655423248500...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-21" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.82712311232246...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-22" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.81372823882016...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-23" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-sy">[</span><span class="crayon-cn">0.59687710752201...</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-24" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73383b497053896-25" class="crayon-line"><span class="crayon-e">only </span><span class="crayon-e">showing </span><span class="crayon-i">top</span> <span class="crayon-cn">20</span> <span class="crayon-v">rows</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-26" class="ftwp-heading" style="text-align: justify;">Đ&aacute;nh gi&aacute; m&ocirc; h&igrave;nh</h4> <p style="text-align: justify;">Bạn cần phải xem chỉ số độ ch&iacute;nh x&aacute;c để xem m&ocirc; h&igrave;nh hoạt động tốt (hoặc xấu) như thế n&agrave;o. Hiện tại, kh&ocirc;ng c&oacute; API n&agrave;o để t&iacute;nh to&aacute;n độ ch&iacute;nh x&aacute;c trong Spark. Gi&aacute; trị mặc định l&agrave; ROC (receiver operating characteristic curve).</p> <p style="text-align: justify;">Trước khi bạn xem x&eacute;t ROC, h&atilde;y x&acirc;y dựng thước đo độ ch&iacute;nh x&aacute;c. Thước đo độ ch&iacute;nh x&aacute;c l&agrave; tổng của dự đo&aacute;n đ&uacute;ng tr&ecirc;n tổng số quan s&aacute;t.</p> <p style="text-align: justify;">Bạn tạo một DataFrame với nh&atilde;n v&agrave; dự đo&aacute;n</p> <div id="urvanov-syntax-highlighter-610ff0b733840043576908" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733840043576908-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733840043576908-1" class="crayon-line"><span class="crayon-v">cm</span> <span class="crayon-o">=</span> <span class="crayon-v">predictions</span><span class="crayon-sy">.</span><span class="crayon-k ">select</span><span class="crayon-sy">(</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span> <span class="crayon-s">"prediction"</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể kiểm tra số lượng lớp trong nh&atilde;n v&agrave; dự đo&aacute;n</p> <div id="urvanov-syntax-highlighter-610ff0b733843033858472" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733843033858472-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733843033858472-1" class="crayon-line"><span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-e">groupby</span><span class="crayon-sy">(</span><span class="crayon-s">'label'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">agg</span><span class="crayon-sy">(</span><span class="crayon-sy">{</span><span class="crayon-s">'label'</span><span class="crayon-o">:</span> <span class="crayon-s">'count'</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733846950376645" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733846950376645-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733846950376645-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733846950376645-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733846950376645-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733846950376645-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733846950376645-6">6</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733846950376645-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733846950376645-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">label</span><span class="crayon-o">|</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-v">label</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733846950376645-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b733846950376645-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">5021</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733846950376645-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;</span><span class="crayon-cn">1.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">1578</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b733846950376645-6" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733849443917684" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733849443917684-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733849443917684-1" class="crayon-line"><span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-e">groupby</span><span class="crayon-sy">(</span><span class="crayon-s">'prediction'</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">agg</span><span class="crayon-sy">(</span><span class="crayon-sy">{</span><span class="crayon-s">'prediction'</span><span class="crayon-o">:</span> <span class="crayon-s">'count'</span><span class="crayon-sy">}</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">show</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73384c390870335-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73384c390870335-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73384c390870335-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73384c390870335-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73384c390870335-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73384c390870335-6">6</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335-1" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335-2" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-v">prediction</span><span class="crayon-o">|</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-v">prediction</span><span class="crayon-sy">)</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335-3" class="crayon-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335-4" class="crayon-line crayon-striped-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">0.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">5982</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335-5" class="crayon-line"><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-cn">1.0</span><span class="crayon-o">|</span><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-cn">617</span><span class="crayon-o">|</span></div> <div id="urvanov-syntax-highlighter-610ff0b73384c390870335-6" class="crayon-line crayon-striped-line"><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">+</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">--</span><span class="crayon-o">-</span><span class="crayon-o">+</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">V&iacute; dụ, trong tập thử nghiệm, c&oacute; 1578 hộ gia đ&igrave;nh c&oacute; thu nhập tr&ecirc;n 50k v&agrave; 5021 hộ dưới. Tuy nhi&ecirc;n, ph&acirc;n loại dự đo&aacute;n 617 hộ gia đ&igrave;nh c&oacute; thu nhập tr&ecirc;n 50 ngh&igrave;n.</p> <p style="text-align: justify;">Bạn c&oacute; thể t&iacute;nh độ ch&iacute;nh x&aacute;c bằng c&aacute;ch t&iacute;nh số lượng khi nh&atilde;n được ph&acirc;n loại ch&iacute;nh x&aacute;c tr&ecirc;n tổng số h&agrave;ng.</p> <div id="urvanov-syntax-highlighter-610ff0b73384f035428977" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73384f035428977-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73384f035428977-1" class="crayon-line"><span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-k ">filter</span><span class="crayon-sy">(</span><span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-v">label</span> <span class="crayon-o">==</span> <span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-v">prediction</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span> <span class="crayon-o">/</span> <span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733853248419083" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733853248419083-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733853248419083-1" class="crayon-line"><span class="crayon-cn">0.8237611759357478</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể kết hợp mọi thứ lại với nhau v&agrave; viết một h&agrave;m để t&iacute;nh độ ch&iacute;nh x&aacute;c.</p> <div id="urvanov-syntax-highlighter-610ff0b733856420023829" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733856420023829-7">7</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-1" class="crayon-line"><span class="crayon-r">def</span> <span class="crayon-e">accuracy_m</span><span class="crayon-sy">(</span><span class="crayon-v">model</span><span class="crayon-sy">)</span><span class="crayon-o">:</span></div> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-2" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">predictions</span> <span class="crayon-o">=</span> <span class="crayon-v">model</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-v">test_data</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-3" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">cm</span> <span class="crayon-o">=</span> <span class="crayon-v">predictions</span><span class="crayon-sy">.</span><span class="crayon-k ">select</span><span class="crayon-sy">(</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span> <span class="crayon-s">"prediction"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-4" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">acc</span> <span class="crayon-o">=</span> <span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-k ">filter</span><span class="crayon-sy">(</span><span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-v">label</span> <span class="crayon-o">==</span> <span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-v">prediction</span><span class="crayon-sy">)</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span> <span class="crayon-o">/</span> <span class="crayon-v">cm</span><span class="crayon-sy">.</span><span class="crayon-e">count</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-s">"Model accuracy: %.3f%%"</span> <span class="crayon-o">%</span> <span class="crayon-sy">(</span><span class="crayon-v">acc</span> <span class="crayon-o">*</span> <span class="crayon-cn">100</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-6" class="crayon-line crayon-striped-line"><span class="crayon-e">accuracy_m</span><span class="crayon-sy">(</span><span class="crayon-v">model</span> <span class="crayon-o">=</span> <span class="crayon-v">linearModel</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733856420023829-7" class="crayon-line"><span class="crayon-e">Model </span><span class="crayon-v">accuracy</span><span class="crayon-o">:</span> <span class="crayon-cn">82.376</span><span class="crayon-o">%</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h5 id="ftoc-heading-27" class="ftwp-heading" style="text-align: justify;">ROC metrics</h5> <p style="text-align: justify;">M&ocirc;-đun BinaryClassificationEvaluator bao gồm c&aacute;c biện ph&aacute;p ROC. Receiver Operating Characteristic curve l&agrave; một c&ocirc;ng cụ phổ biến kh&aacute;c được sử dụng với ph&acirc;n loại nhị ph&acirc;n. N&oacute; rất giống với precision/recall nhưng thay v&igrave; vẽ biểu đồ precision so với recall. ROC cho thấy tỷ lệ dương t&iacute;nh thực sự (tức l&agrave; recall) so với tỷ lệ dương t&iacute;nh giả.Tỷ lệ dương t&iacute;nh giả l&agrave; tỷ lệ c&aacute;c trường hợp ti&ecirc;u cực được ph&acirc;n loại kh&ocirc;ng ch&iacute;nh x&aacute;c l&agrave; dương t&iacute;nh. Tỷ lệ &acirc;m thực sự c&ograve;n được gọi l&agrave; độ đặc hiệu. Do đ&oacute;, đường cong ROC biểu thị độ nhạy (recall) so với 1 &ndash; độ đặc hiệu.</p> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73385d252460488-7">7</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-1" class="crayon-line"><span class="crayon-c">### Use ROC </span></div> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-2" class="crayon-line crayon-striped-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">evaluation </span><span class="crayon-r">import</span> <span class="crayon-i">BinaryClassificationEvaluator</span></div> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-4" class="crayon-line crayon-striped-line"><span class="crayon-c"># Evaluate model</span></div> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-5" class="crayon-line"><span class="crayon-v">evaluator</span> <span class="crayon-o">=</span> <span class="crayon-e">BinaryClassificationEvaluator</span><span class="crayon-sy">(</span><span class="crayon-v">rawPredictionCol</span><span class="crayon-o">=</span><span class="crayon-s">"rawPrediction"</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-6" class="crayon-line crayon-striped-line"><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-v">evaluator</span><span class="crayon-sy">.</span><span class="crayon-e">evaluate</span><span class="crayon-sy">(</span><span class="crayon-v">predictions</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73385d252460488-7" class="crayon-line"><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-v">evaluator</span><span class="crayon-sy">.</span><span class="crayon-e">getMetricName</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733860459616257" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733860459616257-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733860459616257-1" class="crayon-line"><span class="crayon-cn">0.8940481662695192areaUnderROC</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733863562432713" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733863562432713-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733863562432713-1" class="crayon-line"><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-v">evaluator</span><span class="crayon-sy">.</span><span class="crayon-e">evaluate</span><span class="crayon-sy">(</span><span class="crayon-v">predictions</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733865242315384" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733865242315384-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733865242315384-1" class="crayon-line"><span class="crayon-cn">0.8940481662695192</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h4 id="ftoc-heading-28" class="ftwp-heading" style="text-align: justify;">Bước 6) Điều chỉnh si&ecirc;u tham số</h4> <p style="text-align: justify;">Cuối c&ugrave;ng nhưng kh&ocirc;ng k&eacute;m phần quan trọng, bạn c&oacute; thể điều chỉnh c&aacute;c si&ecirc;u tham số.</p> <p style="text-align: justify;">Để giảm thời gian t&iacute;nh to&aacute;n, bạn chỉ điều chỉnh tham số ch&iacute;nh quy chỉ với hai gi&aacute; trị.</p> <div id="urvanov-syntax-highlighter-610ff0b733868419269978" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733868419269978-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733868419269978-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733868419269978-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733868419269978-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733868419269978-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733868419269978-6">6</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733868419269978-1" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-v">pyspark</span><span class="crayon-sy">.</span><span class="crayon-v">ml</span><span class="crayon-sy">.</span><span class="crayon-e">tuning </span><span class="crayon-r">import</span> <span class="crayon-v">ParamGridBuilder</span><span class="crayon-sy">,</span> <span class="crayon-i">CrossValidator</span></div> <div id="urvanov-syntax-highlighter-610ff0b733868419269978-2" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b733868419269978-3" class="crayon-line"><span class="crayon-c"># Create ParamGrid for Cross Validation</span></div> <div id="urvanov-syntax-highlighter-610ff0b733868419269978-4" class="crayon-line crayon-striped-line"><span class="crayon-v">paramGrid</span> <span class="crayon-o">=</span> <span class="crayon-sy">(</span><span class="crayon-e">ParamGridBuilder</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733868419269978-5" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-sy">.</span><span class="crayon-e">addGrid</span><span class="crayon-sy">(</span><span class="crayon-v">lr</span><span class="crayon-sy">.</span><span class="crayon-v">regParam</span><span class="crayon-sy">,</span> <span class="crayon-sy">[</span><span class="crayon-cn">0.01</span><span class="crayon-sy">,</span> <span class="crayon-cn">0.5</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b733868419269978-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span class="crayon-sy">.</span><span class="crayon-e">build</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Cuối c&ugrave;ng, bạn đ&aacute;nh gi&aacute; m&ocirc; h&igrave;nh bằng c&aacute;ch sử dụng phương ph&aacute;p cross valiation.</p> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386c282038553-14">14</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-1" class="crayon-line"><span class="crayon-st">from</span> <span class="crayon-k ">time</span> <span class="crayon-r">import</span> <span class="crayon-o">*</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-2" class="crayon-line crayon-striped-line"><span class="crayon-v">start_time</span> <span class="crayon-o">=</span> <span class="crayon-k ">time</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-3" class="crayon-line"></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-4" class="crayon-line crayon-striped-line"><span class="crayon-c"># Create 5-fold CrossValidator</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-5" class="crayon-line"><span class="crayon-v">cv</span> <span class="crayon-o">=</span> <span class="crayon-e">CrossValidator</span><span class="crayon-sy">(</span><span class="crayon-v">estimator</span><span class="crayon-o">=</span><span class="crayon-v">lr</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-6" class="crayon-line crayon-striped-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">estimatorParamMaps</span><span class="crayon-o">=</span><span class="crayon-v">paramGrid</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-7" class="crayon-line"><span class="crayon-h">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span class="crayon-v">evaluator</span><span class="crayon-o">=</span><span class="crayon-v">evaluator</span><span class="crayon-sy">,</span> <span class="crayon-v">numFolds</span><span class="crayon-o">=</span><span class="crayon-cn">5</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-8" class="crayon-line crayon-striped-line"></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-9" class="crayon-line"><span class="crayon-c"># Run cross validations</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-10" class="crayon-line crayon-striped-line"><span class="crayon-v">cvModel</span> <span class="crayon-o">=</span> <span class="crayon-v">cv</span><span class="crayon-sy">.</span><span class="crayon-e">fit</span><span class="crayon-sy">(</span><span class="crayon-v">train_data</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-11" class="crayon-line"><span class="crayon-c"># likely take a fair amount of time</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-12" class="crayon-line crayon-striped-line"><span class="crayon-v">end_time</span> <span class="crayon-o">=</span> <span class="crayon-k ">time</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-13" class="crayon-line"><span class="crayon-v">elapsed_time</span> <span class="crayon-o">=</span> <span class="crayon-v">end_time</span> <span class="crayon-o">-</span> <span class="crayon-e">start_time</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386c282038553-14" class="crayon-line crayon-striped-line"><span class="crayon-k ">print</span><span class="crayon-sy">(</span><span class="crayon-s">"Time to train model: %.3f seconds"</span> <span class="crayon-o">%</span> <span class="crayon-v">elapsed_time</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Thời gian đ&agrave;o tạo m&ocirc; h&igrave;nh: 978.807 gi&acirc;y</p> <p style="text-align: justify;">Si&ecirc;u tham số đo ch&iacute;nh quy tốt nhất l&agrave; 0,01, với độ ch&iacute;nh x&aacute;c 85,316 phần trăm.</p> <div id="urvanov-syntax-highlighter-610ff0b73386f560102431" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73386f560102431-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73386f560102431-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73386f560102431-1" class="crayon-line"><span class="crayon-e">accuracy_m</span><span class="crayon-sy">(</span><span class="crayon-v">model</span> <span class="crayon-o">=</span> <span class="crayon-v">cvModel</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73386f560102431-2" class="crayon-line crayon-striped-line"><span class="crayon-e">Model </span><span class="crayon-v">accuracy</span><span class="crayon-o">:</span> <span class="crayon-cn">85.316</span><span class="crayon-o">%</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Bạn c&oacute; thể loại trừ tham số được đề xuất bằng c&aacute;ch chaining cvModel.bestModel với extractParamMap().</p> <div id="urvanov-syntax-highlighter-610ff0b733872651012405" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733872651012405-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733872651012405-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733872651012405-1" class="crayon-line"><span class="crayon-v">bestModel</span> <span class="crayon-o">=</span> <span class="crayon-v">cvModel</span><span class="crayon-sy">.</span><span class="crayon-e">bestModel</span></div> <div id="urvanov-syntax-highlighter-610ff0b733872651012405-2" class="crayon-line crayon-striped-line"><span class="crayon-v">bestModel</span><span class="crayon-sy">.</span><span class="crayon-e">extractParamMap</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-2">2</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-3">3</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-4">4</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-5">5</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-6">6</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-7">7</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-8">8</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-9">9</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-10">10</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-11">11</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-12">12</div> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-13">13</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b733875920501297-14">14</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-1" class="crayon-line"><span class="crayon-sy">{</span><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'aggregationDepth'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'suggested depth for treeAggregate (&gt;= 2)'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-cn">2</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-2" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'elasticNetParam'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-cn">0.0</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-3" class="crayon-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'family'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-s">'auto'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-4" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'featuresCol'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'features column name'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-s">'features'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-5" class="crayon-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'fitIntercept'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'whether to fit an intercept term'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-t">True</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-6" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'labelCol'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'label column name'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-s">'label'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-7" class="crayon-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'maxIter'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'maximum number of iterations (&gt;= 0)'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-cn">10</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-8" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'predictionCol'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'prediction column name'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-s">'prediction'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-9" class="crayon-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'probabilityCol'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-s">'probability'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-10" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'rawPredictionCol'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'raw prediction (a.k.a. confidence) column name'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-s">'rawPrediction'</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-11" class="crayon-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'regParam'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'regularization parameter (&gt;= 0)'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-cn">0.01</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-12" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'standardization'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'whether to standardize the training features before fitting the model'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-t">True</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-13" class="crayon-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'threshold'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'threshold in binary classification prediction, in range [0, 1]'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-cn">0.5</span><span class="crayon-sy">,</span></div> <div id="urvanov-syntax-highlighter-610ff0b733875920501297-14" class="crayon-line crayon-striped-line"><span class="crayon-e">Param</span><span class="crayon-sy">(</span><span class="crayon-r">parent</span><span class="crayon-o">=</span><span class="crayon-s">'LogisticRegression_4d8f8ce4d6a02d8c29a0'</span><span class="crayon-sy">,</span> <span class="crayon-v">name</span><span class="crayon-o">=</span><span class="crayon-s">'tol'</span><span class="crayon-sy">,</span> <span class="crayon-v">doc</span><span class="crayon-o">=</span><span class="crayon-s">'the convergence tolerance for iterative algorithms (&gt;= 0)'</span><span class="crayon-sy">)</span><span class="crayon-o">:</span> <span class="crayon-cn">1e</span><span class="crayon-o">-</span><span class="crayon-cn">06</span><span class="crayon-sy">}</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <h3 id="ftoc-heading-29" class="ftwp-heading" style="text-align: justify;">Kết Luận</h3> <p style="text-align: justify;">Spark l&agrave; một c&ocirc;ng cụ cơ bản cho một nh&agrave; khoa học dữ liệu. N&oacute; cho ph&eacute;p kết nối ứng dụng với c&aacute;c nguồn dữ liệu kh&aacute;c nhau, thực hiện ph&acirc;n t&iacute;ch dữ liệu một c&aacute;ch liền mạch hoặc th&ecirc;m m&ocirc; h&igrave;nh dự đo&aacute;n.</p> <p style="text-align: justify;">Để bắt đầu với Spark, bạn cần bắt đầu Spark Context với:</p> <p style="text-align: justify;">`SparkContext()&ldquo;</p> <p style="text-align: justify;">V&agrave; SQL context để kết nối với nguồn dữ liệu:</p> <p style="text-align: justify;">`SQLContext()&ldquo;</p> <p style="text-align: justify;">Trong b&agrave;i viết n&agrave;y, ch&uacute;ng ta đ&atilde; học c&aacute;ch huấn luyện hồi quy logistic:</p> <p style="text-align: justify;">Chuyển đổi tập dữ liệu th&agrave;nh Dataframe với:</p> <div id="urvanov-syntax-highlighter-610ff0b73387b496905016" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73387b496905016-1">1</div> <div class="crayon-num crayon-striped-num" data-line="urvanov-syntax-highlighter-610ff0b73387b496905016-2">2</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73387b496905016-1" class="crayon-line"><span class="crayon-v">rdd</span><span class="crayon-sy">.</span><span class="crayon-k ">map</span><span class="crayon-sy">(</span><span class="crayon-r">lambda</span> <span class="crayon-v">x</span><span class="crayon-o">:</span> <span class="crayon-sy">(</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-s">"newlabel"</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span> <span class="crayon-e">DenseVector</span><span class="crayon-sy">(</span><span class="crayon-v">x</span><span class="crayon-sy">[</span><span class="crayon-s">"features"</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span><span class="crayon-sy">)</span></div> <div id="urvanov-syntax-highlighter-610ff0b73387b496905016-2" class="crayon-line crayon-striped-line"><span class="crayon-v">sqlContext</span><span class="crayon-sy">.</span><span class="crayon-e">createDataFrame</span><span class="crayon-sy">(</span><span class="crayon-v">input_data</span><span class="crayon-sy">,</span> <span class="crayon-sy">[</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span> <span class="crayon-s">"features"</span><span class="crayon-sy">]</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Tạo train/test set</p> <div id="urvanov-syntax-highlighter-610ff0b733881870888068" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733881870888068-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733881870888068-1" class="crayon-line"><span class="crayon-e">randomSplit</span><span class="crayon-sy">(</span><span class="crayon-sy">[</span><span class="crayon-sy">.</span><span class="crayon-cn">8</span><span class="crayon-sy">,</span><span class="crayon-sy">.</span><span class="crayon-cn">2</span><span class="crayon-sy">]</span><span class="crayon-sy">,</span><span class="crayon-v">seed</span><span class="crayon-o">=</span><span class="crayon-cn">1234</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Đ&agrave;o tạo m&ocirc; h&igrave;nh</p> <div id="urvanov-syntax-highlighter-610ff0b733885152432911" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733885152432911-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733885152432911-1" class="crayon-line"><span class="crayon-e">LogisticRegression</span><span class="crayon-sy">(</span><span class="crayon-v">labelCol</span><span class="crayon-o">=</span><span class="crayon-s">"label"</span><span class="crayon-sy">,</span><span class="crayon-v">featuresCol</span><span class="crayon-o">=</span><span class="crayon-s">"features"</span><span class="crayon-sy">,</span><span class="crayon-v">maxIter</span><span class="crayon-o">=</span><span class="crayon-cn">10</span><span class="crayon-sy">,</span> <span class="crayon-v">regParam</span><span class="crayon-o">=</span><span class="crayon-cn">0.3</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <div id="urvanov-syntax-highlighter-610ff0b733888563552114" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b733888563552114-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b733888563552114-1" class="crayon-line"><span class="crayon-v">lr</span><span class="crayon-sy">.</span><span class="crayon-e">fit</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">Đưa ra dự đo&aacute;n</p> <div id="urvanov-syntax-highlighter-610ff0b73388a636754323" class="urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate" style="text-align: justify;" data-settings=" minimize scroll-mouseover"> <div class="urvanov-syntax-highlighter-plain-wrap">&nbsp;</div> <div class="urvanov-syntax-highlighter-main"> <table class="crayon-table"> <tbody> <tr class="urvanov-syntax-highlighter-row"> <td class="crayon-nums " data-settings="show"> <div class="urvanov-syntax-highlighter-nums-content"> <div class="crayon-num" data-line="urvanov-syntax-highlighter-610ff0b73388a636754323-1">1</div> </div> </td> <td class="urvanov-syntax-highlighter-code"> <div class="crayon-pre"> <div id="urvanov-syntax-highlighter-610ff0b73388a636754323-1" class="crayon-line"><span class="crayon-v">linearModel</span><span class="crayon-sy">.</span><span class="crayon-e">transform</span><span class="crayon-sy">(</span><span class="crayon-sy">)</span></div> </div> </td> </tr> </tbody> </table> </div> </div> <p style="text-align: justify;">B&agrave;i viết tiếp theo:<a href="../../cach-dung-scikit-learn-tu-hoc-tensorflow/" target="_blank" rel="noopener">&nbsp;C&aacute;ch d&ugrave;ng&nbsp;Scikit-Learn &ndash; Machine Learning bằng Python</a></p>