programing

확장 성을 고려할 때 조인이 나쁜 이유는 무엇입니까?

nasanasas 2020. 9. 7. 08:15
반응형

확장 성을 고려할 때 조인이 나쁜 이유는 무엇입니까?


조인이 나쁘거나 '느린'이유는 무엇입니까? 나는 이것을 한 번 더 들었다는 것을 안다. 이 인용구를 찾았습니다

문제는 조인이 상대적으로 느리고 특히 매우 큰 데이터 세트에서 느리고 웹 사이트가 느리다는 것입니다. 디스크에서 분리 된 모든 정보를 가져 와서 모두 다시 모으려면 오랜 시간이 걸립니다.

출처

나는 항상 그들이 특히 PK를 찾을 때 빠르다고 생각했습니다. 왜 '느리다'고?


확장 성은 작업 단위당 리소스 사용을 최소화하기 위해 반복되는 작업을 사전 컴퓨팅, 분산 또는 기본 요소로 축소하는 것입니다. 확장을 잘하기 위해 필요하지 않은 작업을 수행하지 않고 실제로 수행하는 작업을 최대한 효율적으로 수행합니다.

물론 이러한 맥락에서 두 개의 개별 데이터 소스를 결합하는 것은 적어도 결합하지 않는 것에 비해 상대적으로 느립니다. 사용자가 요청하는 지점에서 실시간으로 수행해야하는 작업이기 때문입니다.

그러나 대안은 더 이상 두 개의 개별 데이터를 전혀 가지지 않는다는 것을 기억하십시오. 두 개의 서로 다른 데이터 포인트를 동일한 레코드에 넣어야합니다. 어딘가에 결과없이 두 개의 서로 다른 데이터를 결합 할 수 없으므로 장단점을 이해해야합니다.

좋은 소식은 현대의 관계형 데이터베이스가 조인에 능하다 입니다. 좋은 데이터베이스를 잘 사용하면 조인이 느리다고 생각해서는 안됩니다. 데이터베이스는 원시 조인을 가져와 훨씬 더 빠르게 만드는 다양한 확장 성 친화적 인 방법을 제공합니다 .

  • 자연 키가 아닌 대리 키 (자동 번호 / ID 열)에서 조인합니다. 이는 조인 작업 중에 더 작은 (따라서 더 빠른) 비교를 의미합니다.
  • 인덱스
  • 구체화 된 / 인덱싱 된 뷰 (사전 계산 된 조인 또는 관리 된 비정규 화로 생각)
  • 계산 된 열. 이를 사용하여 조인의 키 열을 해시하거나 미리 계산할 수 있으므로 조인에 대한 복잡한 비교가 훨씬 더 작아지고 잠재적으로 사전 인덱싱 될 수 있습니다.
  • 테이블 파티션 (로드를 여러 디스크로 분산하거나 테이블 스캔 일 수있는 것을 파티션 스캔으로 제한하여 대용량 데이터 세트에 도움이 됨)
  • OLAP (특정 종류의 쿼리 / 조인 결과를 미리 계산합니다. 사실은 아니지만 일반적인 비정규 화 라고 생각할 수 있습니다. )
  • 복제, 가용성 그룹, 로그 전달 또는 기타 메커니즘을 통해 여러 서버가 동일한 데이터베이스에 대한 읽기 쿼리에 응답 할 수 있으므로 여러 서버간에 워크로드를 확장 할 수 있습니다.

나는 지금까지 말을 갈 것 관계형 데이터베이스 모두에 존재하는 주된 이유는 당신이 할 수 있도록하는 것입니다 효율적으로 결합 * . 확실히 구조화 된 데이터를 저장하는 것만이 아닙니다 (csv 또는 xml과 같은 플랫 파일 구조를 사용하여 수행 할 수 있음). 내가 나열한 몇 가지 옵션을 사용하면 사전에 조인을 완전히 구축 할 수 있으므로 쿼리를 실행하기 전에 결과가 이미 완료되었습니다. 마치 데이터를 비정규 화 한 것처럼 (당연히 쓰기 작업 속도가 느려집니다).

조인이 느린 경우 데이터베이스를 올바르게 사용하고 있지 않을 수 있습니다.

비정규 화는 이러한 다른 기술이 실패한 후에 만 ​​수행되어야합니다. "실패"를 진정으로 판단 할 수있는 유일한 방법은 의미있는 성능 목표를 설정하고 해당 목표에 대해 측정하는 것입니다. 측정하지 않았다면 비정규 화에 대해 생각조차하기에는 너무 이르다.

* 즉, 단순한 테이블 모음과는 다른 엔티티로 존재합니다. 실제 rdbms의 또 다른 이유는 안전한 동시 액세스입니다.


조인은 비정규 화를 통해 피하는 것보다 느릴있지만 올바르게 사용하면 (적절한 인덱스가있는 열에 조인하는 등) 본질적으로 느리지 않습니다 .

비정규 화는 잘 설계된 데이터베이스 스키마가 성능 문제를 나타내는 경우 고려할 수있는 많은 최적화 기술 중 하나입니다.


기사에 따르면 조인이없는 경우에 비해 느립니다. 이것은 비정규 화로 달성 할 수 있습니다. 그래서 속도와 정규화 사이에는 상충 관계가 있습니다. 조기 최적화도 잊지 마세요 :)


우선, 관계형 데이터베이스의 존재 이유 (존재 이유)는 엔티티 간의 관계를 모델링 할 수 있다는 것입니다. 조인은 단순히 이러한 관계를 탐색하는 메커니즘입니다. 확실히 명목상의 비용이 들지만 조인이 없으면 관계형 데이터베이스를 가질 이유가 없습니다.

In the academic world we learn of things like the various normal forms (1st, 2nd, 3rd, Boyce-Codd, etc.), and we learn about different types of keys (primary, foreign, alternate, unique, etc.) and how these things fit together to design a database. And we learn the rudiments of SQL as well as manipulating both structure and data (DDL & DML).

In the corporate world, many of the academic constructs turn out to be substantially less viable than we had been led to believe. A perfect example is the notion of a primary key. Academically it is that attribute (or collection of attributes) that uniquely identifies one row in the table. So in many problem domains, the proper academic primary key is a composite of 3 or 4 attributes. However, almost everyone in the modern corporate world uses an auto-generated, sequential integer as a table's primary key. Why? Two reasons. The first is because it makes the model much cleaner when you're migrating FKs all over the place. The second, and most germane to this question, is that retrieving data through joins is faster and more efficient on a single integer than it is on 4 varchar columns (as already mentioned by a few folks).

Let's dig a little deeper now into two specific subtypes of real world databases. The first type is a transactional database. This is the basis for many e-commerce or content management applications driving modern sites. With a transaction DB, you're optimizing heavily toward "transaction throughput". Most commerce or content apps have to balance query performance (from certain tables) with insert performance (in other tables), though each app will have its own unique business driven issues to solve.

The second type of real world database is a reporting database. These are used almost exclusively to aggregate business data and to generate meaningful business reports. They are typically shaped differently than the transaction databases where the data is generated and they are highly optimized for speed of bulk data loading (ETLs) and query performance with large or complex data sets.

In each case, the developer or DBA needs to carefully balance both the functionality and performance curves, and there are lots of performance enhancing tricks on both sides of the equation. In Oracle you can do what's called an "explain plan" so you can see specifically how a query gets parsed and executed. You're looking to maximize the DB's proper use of indexes. One really nasty no-no is to put a function in the where clause of a query. Whenever you do that, you guarantee that Oracle will not use any indexes on that particular column and you'll likely see a full or partial table scan in the explain plan. That's just one specific example of how a query could be written that ends up being slow, and it doesn't have anything to do with joins.

And while we're talking about table scans, they obviously impact the query speed proportionally to the size of the table. A full table scan of 100 rows isn't even noticeable. Run that same query on a table with 100 million rows, and you'll need to come back next week for the return.

Let's talk about normalization for a minute. This is another largely positive academic topic that can get over-stressed. Most of the time when we talk about normalization we really mean the elimination of duplicate data by putting it into its own table and migrating an FK. Folks usually skip over the whole dependence thing described by 2NF and 3NF. And yet in an extreme case, it's certainly possible to have a perfect BCNF database that's enormous and a complete beast to write code against because it's so normalized.

So where do we balance? There is no single best answer. All of the better answers tend to be some compromise between ease of structure maintenance, ease of data maintenance and ease of code creation/maintenance. In general, the less duplication of data, the better.

So why are joins sometimes slow? Sometimes it's bad relational design. Sometimes it's ineffective indexing. Sometimes it's a data volume issue. Sometimes it's a horribly written query.

Sorry for such a long-winded answer, but I felt compelled to provide a meatier context around my comments rather than just rattle off a 4-bullet response.


People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.

There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.

Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.

Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.

If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.


Joins are slow if

  • the data is improperly indexed
  • results poorly filtered
  • joining query poorly written
  • data sets very large and complex

So, true, the bigger your data sets the the more processing you'll need for a query but checking and working on the first three options of the above will often yield great results.

Your source gives denormalization as an option. This is fine only as long as you've exhausted better alternatives.


The joins can be slow if large portions of records from each side need to be scanned.

Like this:

SELECT  SUM(transaction)
FROM    customers
JOIN    accounts
ON      account_customer = customer_id

Even if an index is defined on account_customer, all records from the latter still need to be scanned.

For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN or a MERGE JOIN instead.

Note that for a query like this:

SELECT  SUM(transaction)
FROM    customers
JOIN    accounts
ON      account_customer = customer_id
WHERE   customer_last_name = 'Stellphlug'

the join will most probably will be fast: first, an index on customer_last_name will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer will be issued for each Stellphlug to find his transactions.

Despite the fact that these can be billions of records in accounts and customers, only few will actually need to be scanned.


Joins do require extra processing since they have to look in more files and more indexes to "join" the data together. However, "very large data sets" is all relative. What is the definition of large? I the case of JOINs, I think its a reference to a large result set, not that overall dataset.

Most databases can very quickly process a query that selects 5 records from a primary table and joins 5 records from a related table for each record (assuming the correct indexes are in place). These tables can have hundreds of millions of records each, or even billions.

Once your result set starts growing, things are going to slow down. Using the same example, if the primary table results in 100K records, then there will be 500K "joined" records that need to be found. Just pulling that much data out of the database with add delays.

Don't avoid JOINs, just know you may need to optimize/denormalize when datasets get "very large".


Joins are fast. Joins should be considered standard practice with a properly normalized database schema. Joins allow you to join disparate groups of data in a meaningful way. Don't fear the join.

The caveat is that you must understand normalization, joining, and the proper use of indexes.

Beware premature optimization, as the number one failing of all development projects is meeting the deadline. Once you've completed the project, and you understand the trade offs, you can break the rules if you can justify it.

It's true that join performance degrades non-linearly as the size of the data set increases. Therefore, it doesn't scale as nicely as single table queries, but it still does scale.

It's also true that a bird flies faster without any wings, but only straight down.


Also from the article you cited:

Many mega-scale websites with billions of records, petabytes of data, many thousands of simultaneous users, and millions of queries a day are doing is using a sharding scheme and some are even advocating denormalization as the best strategy for architecting the data tier.

and

And unless you are a really large website you probably don't need to worry about this level of complexity.

and

It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle.

The article is discussing mega-sites like Ebay. At that level of usage you are likely going to have to consider something other than plain vanilla relational database management. But in the "normal" course of business (applications with thousands of users and millions of records) those more expensive, more error prone approaches are overkill.


Joins are considered an opposing force to scalability because they're typically the bottleneck and they cannot be easily distributed or paralleled.


Properly designed tables containing with the proper indicies and correctly written queries not always slow. Where ever you heard that:

Why are joins bad or 'slow'

has no idea what they are talking about!!! Most joins will be very fast. If you have to join many many rows at one time you might take a hit as compared to a denormalized table, but that goes back to Properly designed tables, know when to denormalize and when not to. in a heavy reporting system, break out the data in denormalized tables for reports, or even create a data warehouse. In a transactional heavy system normalize the tables.


The amount of temporary data that is generated could be huge based on the joins.

For an example, one database here at work had a generic search function where all of the fields were optional. The search routine did a join on every table before the search began. This worked well in the beginning. But, now that the main table has over 10 million rows... not so much. Searches now take 30 minutes or more.

I was tasked with optimizing the search stored procedure.

The first thing I did was if any of the fields of the main table were being searched, I did a select to a temp table on those fields only. THEN, I joined all the tables with that temp table before doing the rest of the search. Searches where one of the main table fields now take less than 10 seconds.

If none of the main table fields are begin searched, I do similar optimizations for other tables. When I was done, no search takes longer than 30 seconds with most under 10.

CPU utilization of the SQL server also went WAY DOWN.


While joins (presumably due to a normalized design) can obviously be slower for data retrieval than a read from a single table, a denormalized database can be slow for data creation/update operations since the footprint of the overall transaction will not be minimal.

In a normalized database, a piece of data will live in only one place, so the footprint for an update will be as minimal as possible. In a denormalized database, it's possible that the same column in multiple rows or across tables will have to be updated, meaning the footprint would be larger and chance of locks and deadlocks can increase.


Well, yeah, selecting rows from one denormalized table (assuming decent indexes for your query) might be faster that selecting rows constructed from joining several tables, particularly if the joins don't have efficient indexes available.

The examples cited in the article - Flickr and eBay - are exceptional cases IMO, so have (and deserve) exceptional responses. The author specifically calls out the lack of RI and the extent of data duplication in the article.

Most applications - again, IMO - benefit from the validation & reduced duplication provided by RDBMSs.


They can be slow if done sloppily. For example, if you do a 'select *' on a join you will probaby take a while to get stuff back. However, if you carefully choose what columns to return from each table, and with the proper indexes in place, there should be no problem.

참고URL : https://stackoverflow.com/questions/2623852/why-are-joins-bad-when-considering-scalability

반응형